[open-science] Chemistry Data Packages

Peter Murray-Rust pm286 at cam.ac.uk
Thu Apr 15 06:53:38 UTC 2010

This is an exciting and important question. It's of great relevance to
schools and society in general (interest and concerns about chemicals).

I have copied in the Blue Obelisk [the Open Data/Source/Standards movement
in chemistry] as I think this is a mixture of technical issues (see below)
and culture. Note that the Blue Obelisk has created enough Open packages and
experience to solve the technical aspects of the problem - it's a question
of what we want to do and the (possibly non-trivial) aspects of distributing
software and services.

On Wed, Apr 14, 2010 at 7:00 PM, Jean-Claude Bradley <
jeanclaude.bradley at gmail.com> wrote:

> I spoke with Jonathan this week about the chemistry data packages on CKAN
> *http://tinyurl.com/y5vwzrt
> *Are there any plans for being able to do structure (or even substructure)
> searches across all the packages?
I'm very sympathetic to this. There are a variety of possible approaches and
levels of precision. Chemical search is (overly) dominated by the aims of
the pharmaceuticl industry which has developed schemes for searching large
collections (millions or larger) with sophisticated algorithms. This
requires running chemical graph isomorpism software which must either run on
the server (i.e. each data set must be indexed for each new entry and the
software muct be run for each search) or on the client (few people do this
and it requires downloading the data for each search.) This is probably out
of scope and probably irrelevant to most users of CKAN. [There could be a
possibility of providing a metasearcher for Open chemistry but it would
require funding or cost recovery.]

At the other end are consitutional chemical formula and names. For example
"sodium sulfate" could be Na2SO4 or Na2O4S and searching for the string will
often fail. We have done quite a lot in representing formulae in Chemical
Markup Language and in RDF and we intend that SPARQL endpoints will be
coming onstream through several projects. Similarly textmining tools (such
as our Open OSCAR and OPSIN) can provide strings which can be matched in
SPARQL endpoints.

For chemical identity there is the IUPAC InChI which canonicalises the
formula if you know it exactly (sometimes you do, sometimes you don't). The
InChI for aspirin should be a unique string wherever it is found and this
allows all examples anywhere on the web to be found. I'd certainly suggest
that we inchify all molecules in CKAN collections (it only need to be done
once). However it becomes more difficult for multicomponent systems (e.g.
copper sulfate, where there may be 0 or 5 water molecules and the InChI for
these is different).

I think Wikipedia should be an excellent touchstone. See, for example,
http://en.wikipedia.org/wiki/Copper_sulfate. A chemical seach system running
over Wikipedia would be an excellent example of how to do this. It would be
fairly easy to index the thousands (not millions) of chemical entries in WP
and provide a chemical search endpoint (e.g. using the Openbabel software).
Again this needs hosting somewhere. Craig James from eMolecules has taken
steps in this direction and may be able to comment.

So in short I think it's a great idea for CKAN. There wil need to be a
variety of services (no one type hacks all) .
* index all molecules with InChI (for chemical identity). Static and easily
managed through SPARQL
* add CMLFormulae to all molecules. These are normalised and allow for
element counts and matching (e.g. all substances with 1 Cu, 1 S, 4 O and >0
water will find blue copper sulfate. SPARQL should be able to manage this
* host a substructure search for CKAN. This requires maintenance and server
resources. However it's ideal for the cloud if we have snapshots (and many
of our datasets will be "static"). Index them every so often (weeks, months,
whatever we can afford) and update the cloud. Needs a host and maybe a
* host a name2structure service using OPSIN/OSCAR nameResolver. This can
work out structures from names (2-acetoxy-benzoic acid is the formal name
for aspirin) and there are many LOD sources of synonyms (ChEBI, Pubchem,

I suspect that combining forces with Wikipedia should be a god way of
maximising the facilities and impact.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20100415/5055fef4/attachment-0001.html>

More information about the open-science mailing list