[open-science] Why are we leaving it to Google?

Peter Murray-Rust pm286 at cam.ac.uk
Tue Sep 11 17:44:48 UTC 2018


The following is a bit ad hoc, but let's list the reasons that an Open Data
Search is *better*. Let's assume that there are hundreds of millions of
data sets which could come into this. And assume that we have the power to
index any and create additional metadata and derived data. Let's also
assume that Google or any of the major search engines is not intricately
involved with the details of the data, and isn't interested in
per-discipline searches. So it will index using the given metadata, the
associated fulltext, and maybe some name-value pairs. But not chemical
formulae (which I will take as an example and perhaps add others).

My belief is that a given domain understands its data well and that in many
instances has sophisticated tools for indexing data. I see this in
astronomy, particle physics, chemistry, crystallography, molecular and cell
biology, spectroscopy, GIS, clinical trials, etc. These disciplines are
massively enhanced by using domain-specific tools. For example 50 million
chemical structures have been published and can all, in principle, be
indexed. I cannot ever see Google doing this while humans are in control.
So scientific domains should take control of their searching - I don't know
whether they will. That will also mean they take control of their
ontologies. Again molecular biology and crystallography are the ones I know
best, but they have done massive work in creating metadata resources such
as CIF (crystallography) and Gene Ontology . CIF data occurs in
supplemental info (probably the commonest form of "dataset") and IMO this
would be a great place to start. (I am developing software that reads
chemical and crystallographic suppinfo). Similarly if there are micrographs
or sequence alignments or phylogenetic trees or ... these can be searched -
initially at caption level and later at image level itself.

** So using the domain-specific data means our metadata is better **

Then there is the choice of source. What does Google search cover? We don't
know. Does it cover the annual reports of research institutions (e.g.
CGIAR)?  UNESCO reports? Judgments from the European Court? If we leave it
to Google, Google decides **what** is important. It must be up to us to
decide that. The world is split into knowledge haves and have-nots based on
the Megatechs corporate philosophy - we have to support the whole world.
Search engines shape world politics.

** We are in charge of what to index **

Then the delivery. Google has not, and never will have an API that is Open
and credible . The others search engines are probably worse. So we must be
in charge of what is delivered and how.

** We are in charge of the API **

Then re-use and cleaning and mixing. The data we get from a search engine
will doubtless be reused by a machine. It will need validating and cleaning
and annotating. This improvement can be available to future users. As data
goes through our system it gets better. Format conversions, validation,
mixing with other sources ...

** Our system makes the data itself better **

That's probably enough for now - please add. We probably won't win on the
ethical front, but we can certainly win on the quality and usability.


P.


-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20180911/9703affe/attachment-0002.html>


More information about the open-science mailing list