[ckan-dev] Specification of data processing web services.

Fri Jun 15 20:35:57 UTC 2012

On Thu, Jun 14, 2012 at 8:49 PM, Friedrich Lindenberg
<friedrich.lindenberg at okfn.org> wrote:
> Hi all,
>
> this is an interesting discussion, cool to see that CKAN services are
> happening. A few comments:
>
> (1) I would distinguish between a task and a run. A task (currently
> task_type, I guess) is a process, a run is an instance of the process.
> This is pretty standardized lingo, I think, so if you don't want task
> multiplexing, let's just rename the whole thing to run.

A "run" sounds very odd as an url and I pretty sure that is not
standard.  A list of "runs" is pretty weird English.

Perhaps a "job"?

>
> (2) I don't get the point of synchronous tasks, really. They're just
> web calls, even if you can have some discovery stuff through the API
> (not sure who would ever really use that). The relevant part of this
> proposal is definietly async responses and the handling of webhooks,
> and that doesn't apply to sync processing.
>
> (3) Really can't get my head around the (security) implications of the
> webhook headers option - are you absolutely sure that would be sane?

Its not in the header is in the request.   The thing that seem a
problem to me is that the service can post data anywhere you tell it
to.  Which is pretty bad and acts like an open relay.  However, the
service could add a whitelist of domains that it allows to forward
things to.

>
> (4) In the task status, I would allow for log output. This could also
> be an extra endpoint which yields a list of dicts, each at least with
> timestamp, level, error type and message. Example: the datastorer
> could offer a list of all rows that are malformed in an input file.

I imagined the results/status could be as complicated and rich as you
liked.  I think that specifying what could go in them now is a bit
much.

>
> Regarding the task API I would also consider the trade-offs of trying
> to define a generic async REST convention vs. making something that is
> more specific to CKAN.
>
> CKAN is a store for a set of artifacts. Some of those are references,
> but increasingly they are actually stored there (and should, IMHO, be
> immutable). A CKAN processing framework would define controlled
> transitions between an existing artifact (or artifacts) and a newly
> generated artifact. Such semantics were implemented via the
> relationships API, but it really needs to tie into some processing
> backend - such as the one proposed here.
>
> (cf. http://open-biomed.sourceforge.net/opmv/ns.html#sec-desc)

Interesting.  The point of the services and the distributed web in
general is that these things could grow organically.

I think we would need another service completely one that has a tree
as its core model (like any ETL software) to do this.  This is not
CKAN and I am not sure we should stretch it to be one.
I like this approach as CKAN would just be a component in that (ETL)
service and does not have to be the centre always.  We may need to add
some support in ckan for feedback, issues etc and potentially embed
that services interface in it.  I see CKANs job is just to publish
data.

The general immutable artifact approach seems correct though.
>
> Why all the blah-blah? Offering both the async API and a structured
> model for talking about the processes executed in it would help make
> CKAN into an ETL controller of some sort. There are plenty of bits
> missing for this, but I'd be interested to hear if you think this would
> be useful to discuss (and what kind of use cases you imagine).
>
> Cheers,
>
>  - Friedrich
>
>
>
>
> On Thu, Jun 14, 2012 at 6:33 PM, Toby Dacre <toby.okfn at gmail.com> wrote:
>>
>>
>> On 14 June 2012 17:20, David Raznick <kindly at gmail.com> wrote:
>>>
>>> > Any sensible automated task would *never* send a uuid as, as you say the
>>> > service could throw an error so why risk it?
>>> >
>>>
>>> Knowing the id upfront is very useful (you can store you sent
>>> something out, before you sent it), and I thought the point of the
>>> uuid that the chance of a collision was so low that it was worth the
>>> risk. It could be worth the requester namespacing the uuid with the
>>> its domain name to make it more unlikely but that is most likely
>>> overkill.  I think the complication of 2 references is not worth the
>>> tradeoff.
>>
>>
>> it's nice for me to give a reference for something maybe it's my table id or
>> for the use you suggest
>>
>> anyhow I think that's useful but for a prototype it's not an issue just get
>> it right before you release a non-beta version
>>
>>
>>
>>>
>>> _______________________________________________
>>> ckan-dev mailing list
>>> ckan-dev at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>>
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev