[open-linguistics] Adding resources in bulk to the datahub
Sebastian Hellmann
hellmann at informatik.uni-leipzig.de
Tue Oct 8 18:24:22 UTC 2013
Dear Datahub-discuss and CKAN-Dev list (cc'ing OWLG and wg-coord),
I have some question regarding the http://datahub.io
Is it possible to add resources of datasets in bulk?
Actually, we have a two part problem:
1. Adding and updating datasets with a high amount of resources:
e.g. for http://datahub.io/dataset/olia
there are quite a few resources available at http://purl.org/olia (maybe
170)
can we add them to this one dataset?
For NIF I recently made a very simple vocab to describe resources:
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/dev/misc/resources.ttl
http://persistence.uni-leipzig.org/nlp2rdf/resources.json
Ideally, I would think that users can provide and upload such or similar
files and then the dataset and resources get updated.
Does this exist?
I am asking, because we also have a whole lot of other resources here,
which we can not add manually any more.
144 in total, some are added already:
https://docs.google.com/spreadsheet/ccc?key=0AlMk5ouIspH1dGx1R1Rnd1ZXX0xmLXppSWFrcm0wNFE&usp=drive_web#gid=0
So the core of this first question is, what is the best practice for
uploading bulk datasets.
2. Do you have a definition or guidelines what metadata we should have
in datahub.io?
So should we add metadata about all individual files? We don't want to
host the data, just the metadata.
One example of this is probably this:
http://downloads.dbpedia.org/3.9/
It is one dataset, but people also want to download the resources and
need links to the individual files.
So we would need to keep a file list somewhere. Does anybody know a best
practise?
maybe this: http://sw.deri.org/2007/07/sitemapextension/scschema.xsd
or this: http://wiki.dbpedia.org/sitemap
We identified two use cases already:
Use Case 1:
We would need to implement something that allows you to download and
load the data automatically.
Did you know the LOD2 Stack:
https://dl.dropboxusercontent.com/u/375401/tmp/dataViaApt-Get.png
It allows you to download and install CKAN datasets into a triple store
on a unix system automatically:
http://stack.lod2.eu/blog/wp-content/lod2-stack/attachments/983342/3736389.pdf
Use Case 2:
Creation of a Linguistic Linked Data Cloud. John McCrae and Christian
Chiarcos implemented a python program, which already produces a custom
cloud:
http://lists.okfn.org/pipermail/open-linguistics/attachments/20130921/529aa37d/attachment-0001.png
https://github.com/jmccrae/llod-cloud.py
We need a tool to keep the metadata for these two use cases. I really
hope that CKAN or the datahub can do this. I think, in total we are
talking about 20.000 data sets (over the next five years).
Well, depending on how you count them. Maybe 5.000 data set (also
potentially including LRE Map) and 200k urls to individual files.
All the best,
Sebastian
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
* NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20131008/f51eb351/attachment.html>
More information about the open-linguistics
mailing list