[ckan-dev] Specification of data processing web services.

Thu Jun 14 19:49:01 UTC 2012

Hi all,

this is an interesting discussion, cool to see that CKAN services are
happening. A few comments:

(1) I would distinguish between a task and a run. A task (currently
task_type, I guess) is a process, a run is an instance of the process.
This is pretty standardized lingo, I think, so if you don't want task
multiplexing, let's just rename the whole thing to run.

(2) I don't get the point of synchronous tasks, really. They're just
web calls, even if you can have some discovery stuff through the API
(not sure who would ever really use that). The relevant part of this
proposal is definietly async responses and the handling of webhooks,
and that doesn't apply to sync processing.

(3) Really can't get my head around the (security) implications of the
webhook headers option - are you absolutely sure that would be sane?

(4) In the task status, I would allow for log output. This could also
be an extra endpoint which yields a list of dicts, each at least with
timestamp, level, error type and message. Example: the datastorer
could offer a list of all rows that are malformed in an input file.

Regarding the task API I would also consider the trade-offs of trying
to define a generic async REST convention vs. making something that is
more specific to CKAN.

CKAN is a store for a set of artifacts. Some of those are references,
but increasingly they are actually stored there (and should, IMHO, be
immutable). A CKAN processing framework would define controlled
transitions between an existing artifact (or artifacts) and a newly
generated artifact. Such semantics were implemented via the
relationships API, but it really needs to tie into some processing
backend - such as the one proposed here.

(cf. http://open-biomed.sourceforge.net/opmv/ns.html#sec-desc)

Why all the blah-blah? Offering both the async API and a structured
model for talking about the processes executed in it would help make
CKAN into an ETL controller of some sort. There are plenty of bits
missing for this, but I'd be interested to hear if you think this would
be useful to discuss (and what kind of use cases you imagine).

Cheers,

 - Friedrich

On Thu, Jun 14, 2012 at 6:33 PM, Toby Dacre <toby.okfn at gmail.com> wrote:
>
>
> On 14 June 2012 17:20, David Raznick <kindly at gmail.com> wrote:
>>
>> > Any sensible automated task would *never* send a uuid as, as you say the
>> > service could throw an error so why risk it?
>> >
>>
>> Knowing the id upfront is very useful (you can store you sent
>> something out, before you sent it), and I thought the point of the
>> uuid that the chance of a collision was so low that it was worth the
>> risk. It could be worth the requester namespacing the uuid with the
>> its domain name to make it more unlikely but that is most likely
>> overkill.  I think the complication of 2 references is not worth the
>> tradeoff.
>
>
> it's nice for me to give a reference for something maybe it's my table id or
> for the use you suggest
>
> anyhow I think that's useful but for a prototype it's not an issue just get
> it right before you release a non-beta version
>
>
>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
>