[ckan-dev] Specification of data processing web services.

Toby Dacre toby.okfn at gmail.com
Thu Jun 14 12:20:12 UTC 2012


On 14 June 2012 12:36, David Raznick <kindly at gmail.com> wrote:

> Hello All
>
> This is a very rough specification for web-services whose main purpose
> is processing data to be used with ckan, but that also can be used as
> standalone services in their own right. These services can be split
> into two types, "long running" and "synchronous".
>
> An example of a long running service is the datastorer which can get
> delimited data parse it and send it to the ckan datastore.  It would
> be good if people could use this service independent of ckan and just
> get data into ckans datastore (or any elastic search endpoint) by just
> choosing the csv file location and a datastore location.  This way we
> would allow interfaces on top of these services to offer more control
> over the process. This service would also be used by ckan itself.
>
> An example of a synchronous service is one that parses a csv file and
> returns a json representation of it in the same response, this could
> be used by for example by the datastorer example above to do its
> parsing.
>
> Trying to make all services follow a single base specification will be
> very useful for interoperability and ease of use.
>
>
> **GET endpoints**
>
> /status
> Service level information, used to see if the service is up running
> and see when task types it has available.
> i.e {"name": "datastorer", "task_types": ["csv_parse", "store"],
> "version": "1.0"}
>
> /task/task_id
> return information on the particular task
> i.e
>  {
>  "task_type": "csv_parse"
>  "task_id": "fdfdsfsafdsfafdfa"
>  "status": "pending"  # either pending, error or completed.
>
is aborted useful for user aborted tasks

>  "requested_timestamp": "2013-01-01 19:12:33"
>  "sent_data": {"source_url": "www.somedata.org/data.csv",
> "target_url": "www.thedatahub.io/api/data/abf3131-3213-312-321"},
>  }
>
> The service has no obligation to store its tasks after they have been
> completed, however it would be useful for long running tasks that it
> kept them for a period, especially ones that have errors.
>
> /task
> return list of fully qualified urls of the location currently stored
> task information ordered by most resent requested task first.
> eg ["http://datastorer.datahub.io/task/fdsafdsaffdsafsd"
>    "http://datastorer.datahub.io/task/my_task_id"]
> At some point this url should take query params to help search or
> filter for particular tasks and can have limits to stop too many tasks
> returning.
>
> **POST endpoints**
>
> /task/task_id
> Start a new task with a specified task_id.
>

I don't like this as task_id may not be unique what is the point of this
maybe setting a user reference may meet the same need

>
> /task
> Start a new task. (this should generate a uuid for long running tasks)
>
should the uuid not be universal even if just thrown away by the service?

>
> **Task data sent to services**
>
> The task data that should be posted to the services should consist of
> the following
>
> data: data to be posted
>
> result_url (optional):  The url of where the results should be sent
> to. This should be specified by long running tasks. eg
> www.thedatahub.io/api/action/task_status.
>

don't like the name but can't think of a better one at the mo

>
> api_key (optional):  api key needed to post the results to the
> result_url. eg. frewoitryu398wtrhw
>
> this seems to limiting do we want something more like
send_result {
method: POST
url : .....
params : {api_key:....., something_else:...}
http_headers : [{'X-API-KEY':'...'},..]
}

> task_type (optional): if the service has multiple task_types specify
> which you are using. eg. datastorer.  Services with multiple
> task_types should say which are available in their '/status' call.
>
> no idea what you are saying here

> eg.
>  {
>  "data": {"source_url": "www.somedata.org/data.csv", "target_url":
> "www.thedatahub.io/api/data/abf3131-3213-312-321" },
>  "result_url": "www.thedatahub.io/api/action/task_status",
>  "api_key":  "fdsafsaffasdsaf",
>  "task_type": "csv_parse"
>  }
>
>
> **Data returned from services**
>
> The tasks should return:
>
> * A result json if the task is not a long running query.
> * A long running query should just return the task_id like the
> following {"task_id": "the_task_id"}
> * If there is an error it should return the appropriate http status
> code with a json describing the error.
>
> If a result_url is specified. The result json or the error json should
> be sent to that url.
>
> The result json should be of the form.
>
>  {
>  "task_type": "csv_parse"
>  "task_id": "the_task_id"
>  "requested_timestamp": "2013-01-01 19:12:33"
>  "completed_timestamp": "2013-01-01 20:12:33"
>  "sent_data": {"source_url": "www.somedata.org/data.csv",
> "target_url": "www.thedatahub.io/api/data/abf3131-3213-312-321"},
>  "data": {"some": "data"},
>  }
>
>
> If there is an error the json should be of the form.
>  {
>  "task_type": "csv_parse"
>  "task_id": "fdfdsfsafdsfafdfa"
>  "requested_timestamp": "2013-01-01 19:12:33"
>  "completed_timestamp": "2013-01-01 20:12:33"
>  "sent_data": {"source_url": "www.somedata.org/data.csv",
> "target_url": "www.thedatahub.io/api/data/abf3131-3213-312-321"},
>  "error": "error info"},
>  }
>
> **Durability**
>
> Each service can manage the tasks they receive how they want. They
> should try to always to return something, either an error or a result.
> It is upto each service how they queue up long running task.
> It is up to the sender to retry if the service is not available.
>
> **Implementation**
>
> There is no restriction to framework or language used to implement
> each service. It is advised that each service be as independent as
> possible.  There should be no centralised queue and if they are
> sharing database they should not be able to look at the tables that
> other services have created.
>
> I would like to go ahead and make a very simple flask based
> implementation of this.  I think that using a simple embedded thread
> based scheduler (i.e http://packages.python.org/APScheduler/)  to
> queue the tasks with a database table to store them would be
> sufficient for this.
>
> Please tell me what you think of this or if I am reinventing the wheel
> here.  Before going ahead with any work on this there needs some
> discussion.
>

what about authentication and stuff like that?

>
> Thanks
>
> David
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20120614/54f2a95c/attachment-0001.html>


More information about the ckan-dev mailing list