[open-science] Examples of scientific progress from text mining. Was: Re: text-mining licence exemplar

Peter Murray-Rust pm286 at cam.ac.uk
Wed Jun 6 17:18:18 UTC 2012


On Wed, Jun 6, 2012 at 5:29 PM, Robert Muetzelfeldt <r.muetzelfeldt at ed.ac.uk
> wrote:

> On 06/06/12 15:29, Miller, Andrew (ELS-OXF) wrote:
>
>> Hi
>>
>> On text-mining theme more generally: could someone please point me to
>> reliable examples of scientific progress made via text-mining of
>> open-access STM corpus?
>>
>> Andrew
>>
>>
I would distinguish at least:
* the legal ability to mine a collection of documents
* the legal ability to redistribute those documents (a) in original form
(b) annotated

modern high-quality textmining requires the open availability of a corpus
so that the techniques can be validated by the community. In general this
requires one or more of:
* access to CC-BY or CC0 material
* permission from a "rights-owner" or their representative to (a) min and
(b) redistribute the results.

There are very few cases where these apply:
* public domain material such as patents. We have extracted 500,000
reactions out of patents to very high quallity
* CC-BY publishers. IN practice this means BMC or PLoS. We have also mined
the abstracts and full-text of Atmospheric Chemistry and Physics since it
is Open Access CC-BY

I gave an invited lecture at LBM2011 last year and it was generally agreed
that the lack of materail allowed by the pubslihers was a major barrier to
modern text-mining. There are cases where the researchers have carried out
textmining without redictributed the annotated material because they are
not allowed to.

So my guess is fragmented studies in bioscience and a few other disciplines.

So there are relatively few useful examples because of the legal and
contractual restrictions. That does not mean there is no demand. Elsevier
has  granted only 20 permissions in five years. That is nowhere near enough
to be useful. I have no idea whether any of these allow open repubuication
of the results. If they don't I wouldn't call it useful science..



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120606/67aa409e/attachment-0001.html>


More information about the open-science mailing list