[open-science] Examples of scientific progress from text mining. Was: Re: text-mining licence exemplar

Miller, Andrew (ELS-OXF) andrew.miller at elsevier.com
Thu Jun 7 10:30:56 UTC 2012

Thanks for the reply Peter. 

I was wondering specifically in relation to PLoS/BMC as text-mining proof of concept as surely that corpus is now sizeable enough to demonstrate usefulness and additional discovery. 

I hear much about people being able to carry around the PLoS corpus on a usb stick in their pocket, less to do with usefulness in relation to discovery.

Thanks again and best wishes


From: Peter Murray-Rust [mailto:pm286 at cam.ac.uk] 
Sent: Wednesday, June 06, 2012 06:18 PM
To: open-science at lists.okfn.org <open-science at lists.okfn.org> 
Subject: Re: [open-science] Examples of scientific progress from text mining. Was: Re: text-mining licence exemplar 

On Wed, Jun 6, 2012 at 5:29 PM, Robert Muetzelfeldt <r.muetzelfeldt at ed.ac.uk> wrote:

	On 06/06/12 15:29, Miller, Andrew (ELS-OXF) wrote:

		On text-mining theme more generally: could someone please point me to
		reliable examples of scientific progress made via text-mining of
		open-access STM corpus?

I would distinguish at least:
* the legal ability to mine a collection of documents
* the legal ability to redistribute those documents (a) in original form (b) annotated

modern high-quality textmining requires the open availability of a corpus so that the techniques can be validated by the community. In general this requires one or more of:
* access to CC-BY or CC0 material
* permission from a "rights-owner" or their representative to (a) min and (b) redistribute the results.

There are very few cases where these apply:
* public domain material such as patents. We have extracted 500,000 reactions out of patents to very high quallity
* CC-BY publishers. IN practice this means BMC or PLoS. We have also mined the abstracts and full-text of Atmospheric Chemistry and Physics since it is Open Access CC-BY

I gave an invited lecture at LBM2011 last year and it was generally agreed that the lack of materail allowed by the pubslihers was a major barrier to modern text-mining. There are cases where the researchers have carried out textmining without redictributed the annotated material because they are not allowed to.

So my guess is fragmented studies in bioscience and a few other disciplines.

So there are relatively few useful examples because of the legal and contractual restrictions. That does not mean there is no demand. Elsevier has  granted only 20 permissions in five years. That is nowhere near enough to be useful. I have no idea whether any of these allow open repubuication of the results. If they don't I wouldn't call it useful science.. 


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge

Elsevier Limited. Registered Office: The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084 (England and Wales).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120607/547c0913/attachment-0001.html>

More information about the open-science mailing list