[datahub-discuss] Using Datahub for scientific data and metadata

Peter Murray-Rust pm286 at cam.ac.uk
Fri Dec 6 11:10:58 UTC 2013


[repost to list]

I have become excited about the possibility of using the Datahub for
repositing Open scientific information [1] and have started to proptotype
my application.

Simply, I am going to extract facts from the scientific literature and
store them in Datahub. Some facts will be name-value (String) pairs (e.g.
species), others will be structured as XML blobs (e.g. molecules).

I intend to search each publisher daily (using cron) and index about 150
papers into metadata and XML. I don't think the initial byte sizes will
cripple the Datahub but I will back off if it does. In the first instance
the daily trawl of 150 papers will generate about 100 Kb of XML and 1000
metadata tags (name-value strings) per day.

I use Java and have the following questions:
* Is the Java API still current for CKAN/Datahub (I think Ross J wrote it
and have copied him)?
* are there any known issues in what I propose (uploading 150 * 0.5 Kbyte
files /day on an automatic basis)?

Mark Wainwright and I had an initial problem where resetting the metadata
caused the data to be deleted - slightly embarrasing since a reporter from
Nature was looking into the repository and couldn't find anything in. Can
the repo be reset if anything goes wrong?

Hope this makes sense and thanks


P
If you are interested in background, read http://blogs.ch.cam.ac.uk/pmr,
https://vimeo.com/78353557 (5 mins) and
http://www.slideshare.net/petermurrayrust/the-content-mine-presented-at-uksg(slides).

[1] Rufus has suggested this for the last 10 years...and reality and vision
have coalesced.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/datahub-discuss/attachments/20131206/e98485bf/attachment-0002.html>


More information about the datahub-discuss mailing list