[ckan-discuss] CKAN feature roadmap. Support VOID files and SPARQL service description

Timothy Lebo lebot at rpi.edu
Wed Jul 10 17:23:39 BST 2013


Jerven,

On Jul 10, 2013, at 9:39 AM, Jerven Bolleman <me at jerven.eu> wrote:

> Hi Tim,
> 
> This shows that not much work is needed for CKAN to support VoID files as a source for dataset descriptors.

I'm glad it helps point the way.

> However, to be honest it does not solve my perception of what I think is the problem in the CKAN/datahub.io approach.
> And that is that a push model from providers is not sustainable on the long run. 
> What is needed instead is an information pull model i.e. crawling for dataset descriptors on the web and
> regular updates.


I am certain that the code I point to embraces your pull model. As I said, I do *one* HTTP dereference to your data site to obtain a single VoID representation, then I walk it and stuff it into CKAN.
That's pull. And anybody can do it, not necessarily the original data provider. So, a CKAN instance could do it as part of it's "harvesting" (which, I must admit I'm have no direct experience with, so perhaps a CKANer can fill in some details there).

> 
> I currently need to bend my infrastructure and spend time for CKAN registration for no real benefit to me.

The only work that you should do with my proposal is "get the VoID description right", which I would hope should be within your interests anyway.


> And I need to do this for all other dataset aggregators.

Affirm that they, too, should be aggregating based on your VoID, as you mention.

> This does not scale, which is why the UniProt ckan 
> data is badly out of date and not structured the way I would like it.

Perhaps I can invoke my add-metadata.py script against your void, and we can see how it goes?

> 
> To solve this problem in the long run CKAN needs to started pulling data from provided structured sources

yes, as VoID :-)

> instead of 
> making me and everyone else push information into CKAN all the time.

Agreed. Just publish your VoID and let anyone run my script.

Best,
Tim

> 
> Regards,
> Jerven
> 
> 
> 
> On Jul 5, 2013, at 3:53 PM, Timothy Lebo wrote:
> 
>> Jerven,
>> 
>> I have a python (optional SADI-based) script [1] that will walk a "good" VoID file and lower the descriptions into the CKAN representation.
>> I call this nightly with cron and feed it the VoID that resolves from my data site's /void URI.
>> 
>> e.g., you can see the daily updates as more vocabularies are used and an example URI is added:
>> http://datahub.io/dataset/history/ichoose
>> 
>> Datasets in the http://datahub.io/group/prizms group do this based on my Prizms linked data integration and publication platform [2].
>> 
>> HTH.
>> 
>> Regards,
>> Tim Lebo
>> 
>> 
>> [1] https://github.com/timrdf/DataFAQs/blob/master/services/sadi/ckan/add-metadata.py
>> [2] https://github.com/timrdf/prizms/wiki
>> 
>> 
>> 
>> 
>> On Jul 4, 2013, at 10:03 AM, Mark Wainwright <mark.wainwright at okfn.org> wrote:
>> 
>>> This is interesting, though I'm not sure how it would work in
>>> practice. E.g. would it be sufficient for you to have a tool you could
>>> run (by invoking something like "voidckanupdate void.rdf
>>> http://datahub.io") which automatically extracted the information you
>>> wanted to record from the VoID file, and updated the Datahub via the
>>> API?
>>> 
>>> Mark
>>> 
>>> 
>>> On 04/07/2013, Jerven Bolleman <me at jerven.eu> wrote:
>>>> The number of triples. number of links to other datasets, last update
>>>> etc...
>>>> 
>>>> Mainly we need one point for maintaining this kind of data that is pulled.
>>>> Instead of the current approach of
>>>> visit datahub.io make changes manually
>>>> visit identifiers.org make changes manually
>>>> visit biodbcore make changes manually
>>>> etc...
>>>> 
>>>> i.e. currently as a large data provider we need to visit quite a lot of
>>>> this kind of site to fill in and maintain all dataset meta data.
>>>> This is not sustainable which is why I am happy that the other sites are
>>>> looking into parsing VoID files.
>>>> I hope that the datahub.io can do so as well.
>>>> 
>>>> Regards,
>>>> Jerven
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jul 4, 2013 at 3:32 PM, Mark Wainwright
>>>> <mark.wainwright at okfn.org>wrote:
>>>> 
>>>>> Hmm, I guess the common use case is for metadata that doesn't change
>>>>> every month (address, type, description, licence, etc). What is it
>>>>> you're updating monthly? What specific functionality on the Datahub
>>>>> are you suggesting?
>>>>> 
>>>>> Mark
>>>>> 
>>>>> On 04/07/2013, Jerven Bolleman <me at jerven.eu> wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> This is a desired feature to remove manual overhead of maintaining the
>>>>> same
>>>>>> dataset information in many different databases of databases.
>>>>>> 
>>>>>> For example the UniProt sparql endpoint has meta data in its service
>>>>>> description. That you can retrieve here
>>>>>> 
>>>>>> wget --header="Accept:application/rdf+xml"
>>>>>> "http://beta.sparql.uniprot.org/"
>>>>>> (Expect major improvements to this output in the coming months)
>>>>>> 
>>>>>> Or the attached void file.
>>>>>> 
>>>>>> Instead of us updating all this information manually everymonth we
>>>>>> would
>>>>>> rather generate a single void file. That other tools and list than
>>>>> datahub
>>>>>> could use as well.
>>>>>> 
>>>>>> Regards,
>>>>>> Jerven
>>>>>> 
>>>>>> PS. now with gzipped void file.
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jerven Bolleman
>>>>>> me at jerven.eu
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Business development and user engagement manager
>>>>> The Open Knowledge Foundation
>>>>> Empowering through Open Knowledge
>>>>> http://okfn.org/  |  @okfn  |  http://ckan.org  |  @CKANproject
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jerven Bolleman
>>>> me at jerven.eu
>>>> 
>>> 
>>> 
>>> -- 
>>> Business development and user engagement manager
>>> The Open Knowledge Foundation
>>> Empowering through Open Knowledge
>>> http://okfn.org/  |  @okfn  |  http://ckan.org  |  @CKANproject
>>> 
>>> _______________________________________________
>>> ckan-discuss mailing list
>>> ckan-discuss at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/ckan-discuss
>>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-discuss
>>> 
>> 
> 
> 




More information about the ckan-discuss mailing list