[open-science] Examples of scientific progress from text mining. Was: Re: text-mining licence exemplar

Peter Murray-Rust pm286 at cam.ac.uk
Thu Jun 7 11:31:00 UTC 2012


On Thu, Jun 7, 2012 at 11:30 AM, Miller, Andrew (ELS-OXF) <
andrew.miller at elsevier.com> wrote:

> Thanks for the reply Peter.
>
> I was wondering specifically in relation to PLoS/BMC as text-mining proof
> of concept as surely that corpus is now sizeable enough to demonstrate
> usefulness and additional discovery.
>
> I hear much about people being able to carry around the PLoS corpus on a
> usb stick in their pocket, less to do with usefulness in relation to
> discovery.
>

Firstly the science has to be published in PLoS or BMC. There is virtually
no chemistry in either. So it's limited to those fields (though I have
started discussions on elementary particles with people who have some OA
material).

Secondly the scientific result has to be completely contained within that
corpus, else it cannot be published. Thus if Ross and I study phylogenetics
we have to believe that we can get enough out of BMC and PLoS alone.
Suppose we came up with a result and the referees said "There's a study in
Molecular Phylogenetics which doesn't agree with your findings".
Paper rejected. Or we have to measure every paper in MolPhy with a ruler
because we are forbidden - today - to mine it.

So the studies are limited to:
(A) technology. Ross and I are developing general methods
(B) proof of concept - where there is enough to show that there is enough
data to draw some conclusion. But that's very weak
(C) metadata and resources. I can probably get a list of hominids from BMC
without violating Elsevier's rules, because they are high interest and
there aren't huge numbers. That list might be of interest.

So what else did you expect? that if publishers forbid mining of 90% of the
literature that it's impossible to do good mining-based science.

Just don't use this as evidence that no-one wants content-mining as the
#commTollPub industry repeatedly does.

P.






>
> Thanks again and best wishes
>
> Andrew
>
>  *From*: Peter Murray-Rust [mailto:pm286 at cam.ac.uk]
> *Sent*: Wednesday, June 06, 2012 06:18 PM
> *To*: open-science at lists.okfn.org <open-science at lists.okfn.org>
> *Subject*: Re: [open-science] Examples of scientific progress from text
> mining. Was: Re: text-mining licence exemplar
>
>
>
> On Wed, Jun 6, 2012 at 5:29 PM, Robert Muetzelfeldt <
> r.muetzelfeldt at ed.ac.uk> wrote:
>
>> On 06/06/12 15:29, Miller, Andrew (ELS-OXF) wrote:
>>
>>> Hi
>>>
>>> On text-mining theme more generally: could someone please point me to
>>> reliable examples of scientific progress made via text-mining of
>>> open-access STM corpus?
>>>
>>> Andrew
>>>
>>>
> I would distinguish at least:
> * the legal ability to mine a collection of documents
> * the legal ability to redistribute those documents (a) in original form
> (b) annotated
>
> modern high-quality textmining requires the open availability of a corpus
> so that the techniques can be validated by the community. In general this
> requires one or more of:
> * access to CC-BY or CC0 material
> * permission from a "rights-owner" or their representative to (a) min and
> (b) redistribute the results.
>
> There are very few cases where these apply:
> * public domain material such as patents. We have extracted 500,000
> reactions out of patents to very high quallity
> * CC-BY publishers. IN practice this means BMC or PLoS. We have also mined
> the abstracts and full-text of Atmospheric Chemistry and Physics since it
> is Open Access CC-BY
>
> I gave an invited lecture at LBM2011 last year and it was generally agreed
> that the lack of materail allowed by the pubslihers was a major barrier to
> modern text-mining. There are cases where the researchers have carried out
> textmining without redictributed the annotated material because they are
> not allowed to.
>
> So my guess is fragmented studies in bioscience and a few other
> disciplines.
>
> So there are relatively few useful examples because of the legal and
> contractual restrictions. That does not mean there is no demand. Elsevier
> has  granted only 20 permissions in five years. That is nowhere near enough
> to be useful. I have no idea whether any of these allow open repubuication
> of the results. If they don't I wouldn't call it useful science..
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
> Elsevier Limited. Registered Office: The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084 (England and Wales).
>
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120607/28941a6e/attachment-0001.html>


More information about the open-science mailing list