[open-science] Examples of scientific progress from text mining. Was: Re: text-mining licence exemplar

Maximilian Haeussler maximilianh at gmail.com
Thu Jun 7 11:56:49 UTC 2012

BMC and PloS cover 1.2% of the post-2000 PubMed content. BMC is mostly
very small journals. I wouldn't call this a "sizable" corpus. For my
application it was a nice as a little demo, but not more. (See
http://www.ncbi.nlm.nih.gov/pubmed/21325301). Who cares if you write a
software that runs on 1.2% of all recently published papers?

Nevertheless, there are quite a few other papers that start to explore
running on this limited fulltext corpus. You could use the only one
comprehensive fulltext literature search engine to find them:

People have mined cell lines, species names, locations, genes and
mutations in fulltext.

These are only the first steps. It takes a while to get text mining
algorithms to run on 100x longer text strings and import these from
weird formats (like PMCs or also Elsevier's Consyn XML).

With internet searching/mining, it's tricky to find a "proof of
concept" before the data is available. Your question is a little bit
like asking for a good "proof of concept" for Google before the WWW
was searchable. I at least would have never imagined stuff like Google
Scholar, Google Maps and Google Images back in 1994.

I don't think that academics can come up with any usable text mining
website for the average user. We'll have to wait for startups to jump
on this, I guess. Or the established text-mining companies.

I imagine that Roche, Pfizer etc have had concrete success with text
mining from closed-acccess content. Does anyone has a good link to a
big pharma textmining success story? (They do the PDF harvesting
against the terms of services of the publishers, though, so they might
not want to talk to openly about it too much)

Maximilian Haeussler, max at soe.ucsc.edu
mob +1 831 295 0653 office: +1 831 459 5232

On Thu, Jun 7, 2012 at 4:31 AM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
> On Thu, Jun 7, 2012 at 11:30 AM, Miller, Andrew (ELS-OXF)
> <andrew.miller at elsevier.com> wrote:
>> Thanks for the reply Peter.
>> I was wondering specifically in relation to PLoS/BMC as text-mining proof
>> of concept as surely that corpus is now sizeable enough to demonstrate
>> usefulness and additional discovery.
>> I hear much about people being able to carry around the PLoS corpus on a
>> usb stick in their pocket, less to do with usefulness in relation to
>> discovery.
> Firstly the science has to be published in PLoS or BMC. There is virtually
> no chemistry in either. So it's limited to those fields (though I have
> started discussions on elementary particles with people who have some OA
> material).
> Secondly the scientific result has to be completely contained within that
> corpus, else it cannot be published. Thus if Ross and I study phylogenetics
> we have to believe that we can get enough out of BMC and PLoS alone. Suppose
> we came up with a result and the referees said "There's a study in Molecular
> Phylogenetics which doesn't agree with your findings".
> Paper rejected. Or we have to measure every paper in MolPhy with a ruler
> because we are forbidden - today - to mine it.
> So the studies are limited to:
> (A) technology. Ross and I are developing general methods
> (B) proof of concept - where there is enough to show that there is enough
> data to draw some conclusion. But that's very weak
> (C) metadata and resources. I can probably get a list of hominids from BMC
> without violating Elsevier's rules, because they are high interest and there
> aren't huge numbers. That list might be of interest.
> So what else did you expect? that if publishers forbid mining of 90% of the
> literature that it's impossible to do good mining-based science.
> Just don't use this as evidence that no-one wants content-mining as the
> #commTollPub industry repeatedly does.
> P.
>> Thanks again and best wishes
>> Andrew
>> From: Peter Murray-Rust [mailto:pm286 at cam.ac.uk]
>> Sent: Wednesday, June 06, 2012 06:18 PM
>> To: open-science at lists.okfn.org <open-science at lists.okfn.org>
>> Subject: Re: [open-science] Examples of scientific progress from text
>> mining. Was: Re: text-mining licence exemplar
>> On Wed, Jun 6, 2012 at 5:29 PM, Robert Muetzelfeldt
>> <r.muetzelfeldt at ed.ac.uk> wrote:
>>> On 06/06/12 15:29, Miller, Andrew (ELS-OXF) wrote:
>>>> Hi
>>>> On text-mining theme more generally: could someone please point me to
>>>> reliable examples of scientific progress made via text-mining of
>>>> open-access STM corpus?
>>>> Andrew
>> I would distinguish at least:
>> * the legal ability to mine a collection of documents
>> * the legal ability to redistribute those documents (a) in original form
>> (b) annotated
>> modern high-quality textmining requires the open availability of a corpus
>> so that the techniques can be validated by the community. In general this
>> requires one or more of:
>> * access to CC-BY or CC0 material
>> * permission from a "rights-owner" or their representative to (a) min and
>> (b) redistribute the results.
>> There are very few cases where these apply:
>> * public domain material such as patents. We have extracted 500,000
>> reactions out of patents to very high quallity
>> * CC-BY publishers. IN practice this means BMC or PLoS. We have also mined
>> the abstracts and full-text of Atmospheric Chemistry and Physics since it is
>> Open Access CC-BY
>> I gave an invited lecture at LBM2011 last year and it was generally agreed
>> that the lack of materail allowed by the pubslihers was a major barrier to
>> modern text-mining. There are cases where the researchers have carried out
>> textmining without redictributed the annotated material because they are not
>> allowed to.
>> So my guess is fragmented studies in bioscience and a few other
>> disciplines.
>> So there are relatively few useful examples because of the legal and
>> contractual restrictions. That does not mean there is no demand. Elsevier
>> has  granted only 20 permissions in five years. That is nowhere near enough
>> to be useful. I have no idea whether any of these allow open repubuication
>> of the results. If they don't I wouldn't call it useful science..
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
>> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084
>> (England and Wales).
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science

More information about the open-science mailing list