[ckan-dev] Specification of data processing web services.

Thu Jun 14 11:36:14 UTC 2012

Hello All

This is a very rough specification for web-services whose main purpose
is processing data to be used with ckan, but that also can be used as
standalone services in their own right. These services can be split
into two types, "long running" and "synchronous".

An example of a long running service is the datastorer which can get
delimited data parse it and send it to the ckan datastore.  It would
be good if people could use this service independent of ckan and just
get data into ckans datastore (or any elastic search endpoint) by just
choosing the csv file location and a datastore location.  This way we
would allow interfaces on top of these services to offer more control
over the process. This service would also be used by ckan itself.

An example of a synchronous service is one that parses a csv file and
returns a json representation of it in the same response, this could
be used by for example by the datastorer example above to do its
parsing.

Trying to make all services follow a single base specification will be
very useful for interoperability and ease of use.

**GET endpoints**

/status
Service level information, used to see if the service is up running
and see when task types it has available.
i.e {"name": "datastorer", "task_types": ["csv_parse", "store"],
"version": "1.0"}

/task/task_id
return information on the particular task
i.e
  {
  "task_type": "csv_parse"
  "task_id": "fdfdsfsafdsfafdfa"
  "status": "pending"  # either pending, error or completed.
  "requested_timestamp": "2013-01-01 19:12:33"
  "sent_data": {"source_url": "www.somedata.org/data.csv",
"target_url": "www.thedatahub.io/api/data/abf3131-3213-312-321"},
  }

The service has no obligation to store its tasks after they have been
completed, however it would be useful for long running tasks that it
kept them for a period, especially ones that have errors.

/task
return list of fully qualified urls of the location currently stored
task information ordered by most resent requested task first.
eg ["http://datastorer.datahub.io/task/fdsafdsaffdsafsd"
    "http://datastorer.datahub.io/task/my_task_id"]
At some point this url should take query params to help search or
filter for particular tasks and can have limits to stop too many tasks
returning.

**POST endpoints**

/task/task_id
Start a new task with a specified task_id.

/task
Start a new task. (this should generate a uuid for long running tasks)

**Task data sent to services**

The task data that should be posted to the services should consist of
the following

data: data to be posted

result_url (optional):  The url of where the results should be sent
to. This should be specified by long running tasks. eg
www.thedatahub.io/api/action/task_status.

api_key (optional):  api key needed to post the results to the
result_url. eg. frewoitryu398wtrhw

task_type (optional): if the service has multiple task_types specify
which you are using. eg. datastorer.  Services with multiple
task_types should say which are available in their '/status' call.

eg.
  {
  "data": {"source_url": "www.somedata.org/data.csv", "target_url":
"www.thedatahub.io/api/data/abf3131-3213-312-321" },
  "result_url": "www.thedatahub.io/api/action/task_status",
  "api_key":  "fdsafsaffasdsaf",
  "task_type": "csv_parse"
  }

**Data returned from services**

The tasks should return:

* A result json if the task is not a long running query.
* A long running query should just return the task_id like the
following {"task_id": "the_task_id"}
* If there is an error it should return the appropriate http status
code with a json describing the error.

If a result_url is specified. The result json or the error json should
be sent to that url.

The result json should be of the form.

  {
  "task_type": "csv_parse"
  "task_id": "the_task_id"
  "requested_timestamp": "2013-01-01 19:12:33"
  "completed_timestamp": "2013-01-01 20:12:33"
  "sent_data": {"source_url": "www.somedata.org/data.csv",
"target_url": "www.thedatahub.io/api/data/abf3131-3213-312-321"},
  "data": {"some": "data"},
  }

If there is an error the json should be of the form.
  {
  "task_type": "csv_parse"
  "task_id": "fdfdsfsafdsfafdfa"
  "requested_timestamp": "2013-01-01 19:12:33"
  "completed_timestamp": "2013-01-01 20:12:33"
  "sent_data": {"source_url": "www.somedata.org/data.csv",
"target_url": "www.thedatahub.io/api/data/abf3131-3213-312-321"},
  "error": "error info"},
  }

**Durability**

Each service can manage the tasks they receive how they want. They
should try to always to return something, either an error or a result.
It is upto each service how they queue up long running task.
It is up to the sender to retry if the service is not available.

**Implementation**

There is no restriction to framework or language used to implement
each service. It is advised that each service be as independent as
possible.  There should be no centralised queue and if they are
sharing database they should not be able to look at the tables that
other services have created.

I would like to go ahead and make a very simple flask based
implementation of this.  I think that using a simple embedded thread
based scheduler (i.e http://packages.python.org/APScheduler/)  to
queue the tasks with a database table to store them would be
sufficient for this.

Please tell me what you think of this or if I am reinventing the wheel
here.  Before going ahead with any work on this there needs some
discussion.

Thanks

David