[open-science] Examples of scientific progress from text mining. Was: Re: text-mining licence exemplar

Peter Murray-Rust pm286 at cam.ac.uk
Thu Jun 7 12:32:32 UTC 2012

On Thu, Jun 7, 2012 at 12:56 PM, Maximilian Haeussler <maximilianh at gmail.com
> wrote:

> BMC and PloS cover 1.2% of the post-2000 PubMed content. BMC is mostly
> very small journals. I wouldn't call this a "sizable" corpus. For my
> application it was a nice as a little demo, but not more. (See
> http://www.ncbi.nlm.nih.gov/pubmed/21325301). Who cares if you write a
> software that runs on 1.2% of all recently published papers?

Exactly right. I think it's slightly different with Ross's application.

> Nevertheless, there are quite a few other papers that start to explore
> running on this limited fulltext corpus. You could use the only one
> comprehensive fulltext literature search engine to find them:
> http://scholar.google.com/scholar?q=pubmedcentral+text+mining&btnG=&hl=en&as_sdt=0%2C5&as_ylo=2011
> People have mined cell lines, species names, locations, genes and
> mutations in fulltext.
> Entity extraction is almost certainly the most common first step. "What
species does this paper mention?" Our own OSCAR4 will tell you what
chemical are mentioned.

Co-occurrence may be next. What species co-occur with what geomarkers in
the text. But that does not mean that you can easily draw conclusions.
Negation is a serious problem. "Penguins do not occur in Greenland".

These are only the first steps. It takes a while to get text mining
> algorithms to run on 100x longer text strings and import these from
> weird formats (like PMCs or also Elsevier's Consyn XML).

That's the other thing. Every source has its own approach. If everyone used
NLM-DTD that could help. It's no use Elsevier offering me their particular
version of text - they are only one of 100 publishers.

> With internet searching/mining, it's tricky to find a "proof of
> concept" before the data is available. Your question is a little bit
> like asking for a good "proof of concept" for Google before the WWW
> was searchable. I at least would have never imagined stuff like Google
> Scholar, Google Maps and Google Images back in 1994.
> Exactly. I have been trying for 20 years to get people interested in
semantic chemistry. I have developed all the tools. But with the universal
refusal of publishers to allow content mining there isn't much point, is

> I don't think that academics can come up with any usable text mining
> website for the average user. We'll have to wait for startups to jump
> on this, I guess. Or the established text-mining companies.

Yes - but they will have their own agendas - and their own subdiscipines.
That means we have to consume what they give us. Elsevier tells us we can
use their APIs - whatever those are - but they aren't developed in respnse
to need or innovation.

> I imagine that Roche, Pfizer etc have had concrete success with text
> mining from closed-acccess content. Does anyone has a good link to a
> big pharma textmining success story? (They do the PDF harvesting
> against the terms of services of the publishers, though, so they might
> not want to talk to openly about it too much)
> I suspect that publishers like Elsevier forbid them to release details of
the contracts.

Note that according to your colleague AW Elsevier have only allowed 20
instituions to text-mine over the last 5 years. That's about 1% of the
research active universities. How likely are we to get success stories from

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120607/051d71ef/attachment-0001.html>

More information about the open-science mailing list