[ckan-discuss] CKAN feature roadmap. Support VOID files and SPARQL service description

Wed Jul 10 15:03:35 BST 2013

Jerven,

Might this be something that could implemented as an extension to the harvester at https://github.com/okfn/ckanext-harvest?  

Ross

On 10 Jul 2013, at 14:39, Jerven Bolleman <me at jerven.eu> wrote:

> Hi Tim,
> 
> This shows that not much work is needed for CKAN to support VoID files as a source for dataset descriptors.
> However, to be honest it does not solve my perception of what I think is the problem in the CKAN/datahub.io approach.
> And that is that a push model from providers is not sustainable on the long run. 
> What is needed instead is an information pull model i.e. crawling for dataset descriptors on the web and
> regular updates.
> 
> I currently need to bend my infrastructure and spend time for CKAN registration for no real benefit to me.
> And I need to do this for all other dataset aggregators. This does not scale, which is why the UniProt ckan 
> data is badly out of date and not structured the way I would like it.
> 
> To solve this problem in the long run CKAN needs to started pulling data from provided structured sources instead of 
> making me and everyone else push information into CKAN all the time.
> 
> Regards,
> Jerven
> 
> 
> 
> On Jul 5, 2013, at 3:53 PM, Timothy Lebo wrote:
> 
>> Jerven,
>> 
>> I have a python (optional SADI-based) script [1] that will walk a "good" VoID file and lower the descriptions into the CKAN representation.
>> I call this nightly with cron and feed it the VoID that resolves from my data site's /void URI.
>> 
>> e.g., you can see the daily updates as more vocabularies are used and an example URI is added:
>> http://datahub.io/dataset/history/ichoose
>> 
>> Datasets in the http://datahub.io/group/prizms group do this based on my Prizms linked data integration and publication platform [2].
>> 
>> HTH.
>> 
>> Regards,
>> Tim Lebo
>> 
>> 
>> [1] https://github.com/timrdf/DataFAQs/blob/master/services/sadi/ckan/add-metadata.py
>> [2] https://github.com/timrdf/prizms/wiki
>> 
>> 
>> 
>> 
>> On Jul 4, 2013, at 10:03 AM, Mark Wainwright <mark.wainwright at okfn.org> wrote:
>> 
>>> This is interesting, though I'm not sure how it would work in
>>> practice. E.g. would it be sufficient for you to have a tool you could
>>> run (by invoking something like "voidckanupdate void.rdf
>>> http://datahub.io") which automatically extracted the information you
>>> wanted to record from the VoID file, and updated the Datahub via the
>>> API?
>>> 
>>> Mark
>>> 
>>> 
>>> On 04/07/2013, Jerven Bolleman <me at jerven.eu> wrote:
>>>> The number of triples. number of links to other datasets, last update
>>>> etc...
>>>> 
>>>> Mainly we need one point for maintaining this kind of data that is pulled.
>>>> Instead of the current approach of
>>>> visit datahub.io make changes manually
>>>> visit identifiers.org make changes manually
>>>> visit biodbcore make changes manually
>>>> etc...
>>>> 
>>>> i.e. currently as a large data provider we need to visit quite a lot of
>>>> this kind of site to fill in and maintain all dataset meta data.
>>>> This is not sustainable which is why I am happy that the other sites are
>>>> looking into parsing VoID files.
>>>> I hope that the datahub.io can do so as well.
>>>> 
>>>> Regards,
>>>> Jerven
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jul 4, 2013 at 3:32 PM, Mark Wainwright
>>>> <mark.wainwright at okfn.org>wrote:
>>>> 
>>>>> Hmm, I guess the common use case is for metadata that doesn't change
>>>>> every month (address, type, description, licence, etc). What is it
>>>>> you're updating monthly? What specific functionality on the Datahub
>>>>> are you suggesting?
>>>>> 
>>>>> Mark
>>>>> 
>>>>> On 04/07/2013, Jerven Bolleman <me at jerven.eu> wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> This is a desired feature to remove manual overhead of maintaining the
>>>>> same
>>>>>> dataset information in many different databases of databases.
>>>>>> 
>>>>>> For example the UniProt sparql endpoint has meta data in its service
>>>>>> description. That you can retrieve here
>>>>>> 
>>>>>> wget --header="Accept:application/rdf+xml"
>>>>>> "http://beta.sparql.uniprot.org/"
>>>>>> (Expect major improvements to this output in the coming months)
>>>>>> 
>>>>>> Or the attached void file.
>>>>>> 
>>>>>> Instead of us updating all this information manually everymonth we
>>>>>> would
>>>>>> rather generate a single void file. That other tools and list than
>>>>> datahub
>>>>>> could use as well.
>>>>>> 
>>>>>> Regards,
>>>>>> Jerven
>>>>>> 
>>>>>> PS. now with gzipped void file.
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jerven Bolleman
>>>>>> me at jerven.eu
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Business development and user engagement manager
>>>>> The Open Knowledge Foundation
>>>>> Empowering through Open Knowledge
>>>>> http://okfn.org/  |  @okfn  |  http://ckan.org  |  @CKANproject
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jerven Bolleman
>>>> me at jerven.eu
>>>> 
>>> 
>>> 
>>> -- 
>>> Business development and user engagement manager
>>> The Open Knowledge Foundation
>>> Empowering through Open Knowledge
>>> http://okfn.org/  |  @okfn  |  http://ckan.org  |  @CKANproject
>>> 
>>> _______________________________________________
>>> ckan-discuss mailing list
>>> ckan-discuss at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/ckan-discuss
>>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-discuss
>>> 
>> 
> 
> 
> _______________________________________________
> ckan-discuss mailing list
> ckan-discuss at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-discuss
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-discuss/attachments/20130710/843c3afe/attachment.htm>