[open-science] Why are we leaving it to Google?

Peter Kraker pkraker at openknowledgemaps.org
Thu Sep 13 18:28:24 UTC 2018


Thanks, Peter. That's a very great roundup of advantages. Here is what I
would add:

Discovery is one of the most under-served processes of the research
life-cycle. Innovation in traditional academic literature search engines
such as Google Scholar is almost non-existent - they all work like they
used to 15 years ago. However, the amount of scientific knowledge has
doubled since then, and search engines cannot cope with the amount of
resources anymore. Try e.g. to find relevant papers on "academic
literature search engines" on Google Scholar:
https://scholar.google.at/scholar?q=academic+literature+search+engines

By leaving dataset search to a proprietary service, we risk that the
same thing will happen to datasets. An open index, on the other hand,
will enable a multitude of services building on top of each other,
similar to the many services that build upon the PubMed corpus. This
will not only improve researchers' capability to discover relevant
datasets from the get-go, but ensure that innovation will happen in an
ever-evolving ecosystem of apps and services.

** Our system improves discovery itself **

Best,
Peter

On 11/09/2018 19:44, Peter Murray-Rust wrote:
> The following is a bit ad hoc, but let's list the reasons that an Open
> Data Search is *better*. Let's assume that there are hundreds of
> millions of data sets which could come into this. And assume that we
> have the power to index any and create additional metadata and derived
> data. Let's also assume that Google or any of the major search engines
> is not intricately involved with the details of the data, and isn't
> interested in per-discipline searches. So it will index using the
> given metadata, the associated fulltext, and maybe some name-value
> pairs. But not chemical formulae (which I will take as an example and
> perhaps add others).
>
> My belief is that a given domain understands its data well and that in
> many instances has sophisticated tools for indexing data. I see this
> in astronomy, particle physics, chemistry, crystallography, molecular
> and cell biology, spectroscopy, GIS, clinical trials, etc. These
> disciplines are massively enhanced by using domain-specific tools. For
> example 50 million chemical structures have been published and can
> all, in principle, be indexed. I cannot ever see Google doing this
> while humans are in control. So scientific domains should take control
> of their searching - I don't know whether they will. That will also
> mean they take control of their ontologies. Again molecular biology
> and crystallography are the ones I know best, but they have done
> massive work in creating metadata resources such as CIF
> (crystallography) and Gene Ontology . CIF data occurs in supplemental
> info (probably the commonest form of "dataset") and IMO this would be
> a great place to start. (I am developing software that reads chemical
> and crystallographic suppinfo). Similarly if there are micrographs or
> sequence alignments or phylogenetic trees or ... these can be searched
> - initially at caption level and later at image level itself.
>
> ** So using the domain-specific data means our metadata is better **
>
> Then there is the choice of source. What does Google search cover? We
> don't know. Does it cover the annual reports of research institutions
> (e.g. CGIAR)?  UNESCO reports? Judgments from the European Court? If
> we leave it to Google, Google decides **what** is important. It must
> be up to us to decide that. The world is split into knowledge haves
> and have-nots based on the Megatechs corporate philosophy - we have to
> support the whole world. Search engines shape world politics.
>
> ** We are in charge of what to index **
>
> Then the delivery. Google has not, and never will have an API that is
> Open and credible . The others search engines are probably worse. So
> we must be in charge of what is delivered and how.
>
> ** We are in charge of the API **
>
> Then re-use and cleaning and mixing. The data we get from a search

> engine will doubtless be reused by a machine. It will need validating
> and cleaning and annotating. This improvement can be available to
> future users. As data goes through our system it gets better. Format
> conversions, validation, mixing with other sources ...
>
> ** Our system makes the data itself better **
>
> That's probably enough for now - please add. We probably won't win on
> the ethical front, but we can certainly win on the quality and usability.
>
>
> P.
>
>
> -- 
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20180913/02743fa7/attachment-0002.html>


More information about the open-science mailing list