[datahub-discuss] Using Datahub for scientific data and metadata

Peter Murray-Rust pm286 at cam.ac.uk
Fri Dec 6 11:10:58 UTC 2013

[repost to list]

I have become excited about the possibility of using the Datahub for
repositing Open scientific information [1] and have started to proptotype
my application.

Simply, I am going to extract facts from the scientific literature and
store them in Datahub. Some facts will be name-value (String) pairs (e.g.
species), others will be structured as XML blobs (e.g. molecules).

I intend to search each publisher daily (using cron) and index about 150
papers into metadata and XML. I don't think the initial byte sizes will
cripple the Datahub but I will back off if it does. In the first instance
the daily trawl of 150 papers will generate about 100 Kb of XML and 1000
metadata tags (name-value strings) per day.

I use Java and have the following questions:
* Is the Java API still current for CKAN/Datahub (I think Ross J wrote it
and have copied him)?
* are there any known issues in what I propose (uploading 150 * 0.5 Kbyte
files /day on an automatic basis)?

Mark Wainwright and I had an initial problem where resetting the metadata
caused the data to be deleted - slightly embarrasing since a reporter from
Nature was looking into the repository and couldn't find anything in. Can
the repo be reset if anything goes wrong?

Hope this makes sense and thanks

If you are interested in background, read http://blogs.ch.cam.ac.uk/pmr,
https://vimeo.com/78353557 (5 mins) and

[1] Rufus has suggested this for the last 10 years...and reality and vision
have coalesced.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/datahub-discuss/attachments/20131206/e98485bf/attachment-0002.html>

More information about the datahub-discuss mailing list