[ckan-discuss] API For Package Name

Wed Dec 1 10:25:46 GMT 2010

* [2010-11-28 23:12:33 +0000] John Bywater <john.bywater at appropriatesoftware.net> écrit:

] >Yes, the RDF/JSON is basically used to avoid needing to parse RDF. 
] >Apart from that it is just another serialisation, the data model stays
] >the same.
] >
] 
] You mean parse RDF/XML? Does RDF/JSON not have exactly the same 
] complexity? How is it adequate otherwise? Questions for myself perhaps...

Yes, I meant parsing RDF/{XML,N3,NT}. RDF/JSON is equivalent but it is
natively JSON so a JSON program can just eval() it.

] Would any other tools be able to take advantage of it? It seems to be 
] similar to CSV. CSV also doesn't "have a concept of" URLs. I've always 
] liked the idea of using URLs as identifiers in the API, but I'm just 
] trying identify exactly what we gain by introducing URLs into the data 
] formats.

I guess it means that third parties can write statements about them
because there is a way to refer to them with the URI. So merely by
using URIs it means others can write RDF. A good exampe of this is the
licenses. If the license field were a URI it means that I could go
and, for example, try to model the license independently and it would
still join up.

] Sounds good! At the same time, it occurs to me that many of the other 
] interfaces I've seen over the last few days present the different 
] content types at the same locations. I can't remember seeing one that 
] redirects to a different domain. Is that a common or expected thing
] to do?

Good question. There's definitely nothing wrong with redirecting to a
different ocation, but it might be unusual. One circumstance where
that happens commonly is where people use services like
http://purl.org/ which are purely redirect services. There are
definitely ways we could work around this, one is by putting the RDF
generation stuff directly in CKAN, another is by playing games with
the web server config to hook certain URLs before they get to the
CKAN application.

] I don't know (think you might have told me once...) what exactly the 
] cohesive mechanism is behind the RDF service. Whatever it is, perhaps we 
] could at least consider calling the service from CKAN's API controllers 
] and have the RDF content returned directly? We could also support the 
] extensions .json .rdf .n3 so content type can be specified in the locator.

The way it is implemented now, there is a nightly cron job that crawls
the API and writes out flat files. There is a very simple little
content autonegotiation script that handles serving out the different
serialisations. I would like to eventually change this to listen to
the queue or RSS feed and only generate what ones have changed...

One reason to keep this outside of ckan is the extensions of which
more below.

] >It already does: http://semantic.ckan.net/sparql and above links...
] 
] That's really great. I hadn't visited those pages before. You'll have to 
] explain me how it works. I'm guessing (remembering?) there is a triple 
] store and you update it from CKAN's API?

Thanks :)

The cron job above uses rdflib to build the RDF representation. In
the catalogue.data.gov.uk case it just uses a memory store as
temporary storage before writing out the flat files. These get tarred
up and rsynced to the same machine as semantic.ckan.net and then dumped
into the virtuoso store for querying. For semantic.ckan.net it just
uses the virtuoso store instead of the memory store so there is no
need for the extra step. When you request an individual .rdf or .n3
file it just comes off the disk though.

] >There's always room for improvement of course, most immediately
] >separating out extension descriptions (e.g. so that Richard can
] >generate voiD separately and have it pulled into the store) and doing
] >something similar for the other instances, but I think CKAN already
] >has the proverbial 5 stars.
] >
] 
] That's great. What do you mean by "separating out extension descriptions"?

Right. So there is some generic information about each package that we
can express with DCAT/DC and friends. Then there are the packages in
the LODCloud group. They have extra metadata that only makes sense for
RDF datasets (sparql endpoints, example resources, counts of triples
and links to other datasets, etc). There is a vocabulary for this,
called voiD [0]. A void:Dataset could be understood like a subclass of
a dcat:Dataset (I'm not sure if this has been explicitly declared to
be so, there was some discussion about it).

So now, the ckanrdf scripts [1] generate the DCat stuff and look at
the extras and tags and also generate voiD if they look like they are
RDF-related. This is fine as far as it goes, but sooner or later it
will lead to a maintenance nightmare. What we really want is the
curators of groups, who are the ones who have a much better idea what
their extras conventions are and how they might map to RDF to be able
to generate these extra bits of description themselves, and have a
mechanism for contributing them back.

Another example is data.gov.uk. Our generic dataset describing script
shouldn't need to know the (suprisingly intricate) details of how to
determine and reference a UK public body, so we want a separate
process to do that.

Yet another is the library linked data group that will probably end up
with a bunch of library data in rdf related extras and tags that are
really irrelevant to the other datasets.

By separating these things out we also enable the community to take on
some of the work for the datasets that they care about and to do so in
whichever programming idiom they are most comfortable (maybe they like
PHP or Ruby more than Python).

[0] http://vocab.deri.ie/void
[1] http://bitbucket.org/ww/ckanrdf

] Is the separate "semantic" hostname particularly desirable? If so, would 
] it be desirable to have a "semantic." companion for each CKAN site?

In general I don't really think so. It would be nice to have them
together in some way. We definitely want to arrange so that all their
catalogues end up in one triplestore so you can easily do queries
across them but that's a slightly separate issue. We could just have
some space within semantic.ckan.net/{ca,de,ie,...}/..., that would be one
option. Putting it directly in {ca,de,ie,...}.ckan.net means we have
to manage harvesting of community provided descriptions within those
installations in some way though...

] If RDF was returned by the API, it could be returned by resources such as:
] 
] http://catalogue.data.gov.uk/api
] 
] What's '{"version": "1"}' in RDF? :-)

Strictly speaking, this bit of JSON is underspecified. "version 1" of
what? It seems obvious if you (as a human) look at that URI and break
it apart, do some background research and discover that the site runs
CKAN and CKAN has an API and infer that that's probably what that bit
of JSON refers to (and not the version of CKAN for instance)...

There is some way of expressing software package versions (see DOAP
which you may be familiar with frin PyPi). 

So maybe you have something like,

<http://catalogue.data.gov.uk/api> :supports ckan:APIv1 .

ckan:API rdfs:label "CKAN API";
         foaf:homepage <http://ckan.org/wiki/page/describing/etc>;
         dcterms:hasVersion ckan:APIv1.

ckan:APIv1 rdfs:label "CKAN APIv1";
         foaf:homepage <http://ckan.org/wiki/page/describing/v1>.

(I made up the :supports predicate, there might be an existing one
that is good to use or we might have to invent our own, and we would
need to map the ckan: prefix to a namespace where we describe CKAN)

] Whatever the service architecture, given there are so many possibilities 
] in between that appear to offer little but agony, we're in great shape. 
] I think we could very usefully document these different interfaces (the 
] Web Interface, the Semantic Interface, and the Domain Model Interface) 
] as a coherent multi-channel provision. I know the Web UI package details 
] package presented together links to package resources in different 
] formats. But we could make something more of the range of different 
] service capabilities. Now that we've identified the rather different 
] worlds each addresses, perhaps we could document the different 
] engineering purposes?
] 
] Or am I just catching up with what everybody already knows? :-)

No, certainly as always there is a general lack of documentation for
all of this. It's discoverable in a "follow your nose" sense but we
really should work to make it more obvious. We really do need to work
out our conventions for the other CKAN instances in terms of minting
URIs and decide on things like moving some of the logic up into the
webserver config, etc.

Cheers,
-w

-- 
William Waites
http://eris.okfn.org/ww/foaf#i
9C7E F636 52F6 1004 E40A  E565 98E3 BBF3 8320 7664