[open-linguistics] Adding resources in bulk to the datahub

Sebastian Hellmann hellmann at informatik.uni-leipzig.de
Tue Oct 8 18:24:22 UTC 2013


Dear Datahub-discuss and CKAN-Dev list (cc'ing OWLG and wg-coord),

I have some question regarding the http://datahub.io
Is it possible to add resources of datasets in bulk?

Actually, we have a two part problem:

1. Adding and updating datasets with a high amount of resources:
e.g. for http://datahub.io/dataset/olia
there are quite a few resources available at http://purl.org/olia (maybe 
170)
can we add them to this one dataset?

For NIF I recently made a very simple vocab to describe resources:
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/dev/misc/resources.ttl
http://persistence.uni-leipzig.org/nlp2rdf/resources.json

Ideally, I would think that users can provide and upload such or similar 
files and then the dataset and resources get updated.
Does this exist?

I am asking, because we also have a whole lot of other resources here, 
which we can not add manually any more.
144 in total, some are added already: 
https://docs.google.com/spreadsheet/ccc?key=0AlMk5ouIspH1dGx1R1Rnd1ZXX0xmLXppSWFrcm0wNFE&usp=drive_web#gid=0

So the core of this first question is, what is the best practice for 
uploading bulk datasets.

2. Do you have a definition or guidelines what metadata we should have 
in datahub.io?

So should we add metadata about all individual files? We don't want to 
host the data, just the metadata.

One example of this is probably this:
http://downloads.dbpedia.org/3.9/

It is one dataset, but people also want to download the resources and 
need links to the individual files.
So we would need to keep a file list somewhere. Does anybody know a best 
practise?
maybe this: http://sw.deri.org/2007/07/sitemapextension/scschema.xsd
or this: http://wiki.dbpedia.org/sitemap

We identified two use cases already:

Use Case 1:
We would need to implement something that allows you to download and 
load the data automatically.
Did you know the LOD2 Stack: 
https://dl.dropboxusercontent.com/u/375401/tmp/dataViaApt-Get.png
It allows you to download and install CKAN datasets into a triple store 
on a unix system automatically:
http://stack.lod2.eu/blog/wp-content/lod2-stack/attachments/983342/3736389.pdf

Use Case 2:
Creation of a Linguistic Linked Data Cloud. John McCrae and Christian 
Chiarcos implemented a python program, which already produces a custom 
cloud:
http://lists.okfn.org/pipermail/open-linguistics/attachments/20130921/529aa37d/attachment-0001.png
https://github.com/jmccrae/llod-cloud.py

We need a tool to keep the metadata for these two use cases. I really 
hope that CKAN or the datahub can do this. I think, in total we are 
talking about 20.000 data sets (over the next five years).
Well, depending on how you count them. Maybe 5.000 data set (also 
potentially including LRE Map) and 200k urls to individual files.

All the best,
Sebastian


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
* NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20131008/f51eb351/attachment.html>


More information about the open-linguistics mailing list