[Open-access] [open-science-dev] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Peter Murray-Rust pm286 at cam.ac.uk
Sat Aug 18 21:38:36 UTC 2012


Thanks, Nathan.
No one is going to put pressure on you to do anything - but the
oppoertunity for jointly coalescing projects is there.

On Sat, Aug 18, 2012 at 10:14 PM, Nathan Rice <
nathan.alexander.rice at gmail.com> wrote:

>
> Wow, I'm surprised this has made its way around as much as it has.  I
> suppose if a project makes a jaded bioinformatics guy like me excited,
> it shouldn't surprise me that others would find it interesting too.
>
> There is huge potential in automation which is why it's exciting. I
realise I didn't answer your original question - see later.

I'm still a little bit intimidated by the amount of work that will be
> involved in getting a really solid, fully automated pipeline.


Then take things at a pace that can be managed. No-one has to be a hero by
themselves.


> I'm
> trying to take it a step at a time.  I've almost finished a curated
> list of plants, experimental molecules and compounds, and I'm fine
> tuning the pubmed search code to reduce the initial signal to noise
> ratio.
>

It may be that when you expose those you will find overlap with other
people.

>
> I'm still not sure exactly how I want to go about selecting articles
> to use as the training data set for article filtration.  A manually
> curated list would probably work best, but given the number of
> features that are available, I expect that the training set would need
> to be at least 1,000 articles in size to get decent results.  This
> might just be one of those cases where I need to bite the bullet, put
> a large pot of coffee on, and get to work.
>
> To do content mining properly requires a considerable annotated corpus.
Generally it's split 3 ways - training, testing and validation. But such a
corpus is very valuable. Unfortunately copyright normally means it can't be
redistributed (I've had this fight with publishers). However that will
change as they realise that alienating the world won't work as they aren't
very competent totalitarians.


> >   - PDF hacking (I have done a lot of this but we need more. Open font
> info.
> > Postscript reconstruction
>
> I have played with this a bit, one issue that is frustrating is many
> PDFanalysis tools will randomly insert spaces due to font kerning, and
> will order text based on vertical position on the page, rather than
> preserving column order.  If there is a PDF text extraction tool that
> doesn't do these I would love to know.
>

I work with PDFBox and pull this out character by character. I throw away
all sequential information and only use coordinates and font-size.  This
works pretty well for me. I can see some excessive kernings and ligatures
may defeat it but at present I suspect I get less than 1 spurious space per
1000 chars. And remember we also have vocabularies to help tune this.

>
>
> >   - shallow natural language processing and NPL resources (e.g. vocabs,
> > character sets)
> >   - classification techniques (e.g. Lucene/Solr) for text and diagrams
> >
> > I think if we harness all these we will have a large step change in the
> > automation of extraction of scientific information from "the literature".
> >
> > And one-by-one the publishers will come to us because they will need us.
>
> It is really a shame that metadata isn't more standardized for journal
> articles.


We have been addressing this in the Open biblio project(s). BibJSON acts as
an unofficial normalization of article metadata. If you mean
domain-specific metadata then we have to do this ourselves - and I am
confident we can - it will be better than keywords (I have little faith in
them)


>  Pubmed MeSH terms and chemical lists are OK but there is so
> much more that could be annotated for the article.
>
> I am very interested in generic classifiers at this level.



> > Timescale - about 1 year to have something major to report - about 5
> years
> > to change the way scientific information is managed.
>
> Scientific articles seem like the perfect place for semantic metadata.
>  In particular, clinical trial articles should have a nice, standard
> set of metadata artifacts for computer analysis, since they are so
> cookie cutter.
>

I looked at this 2-3 years ago - for clinical trials on nutrition. IIRC the
abstracts were very useful metadata - they were structured and used
standard-ish terms. I think they could be NLP'ed quite well.

>
>
> I have actually already invested some of my (unfortunately scant)
> resources into having people go through mined pubmed articles and
> create metadata annotations.  Unfortunately, without a lot of machine
> learning input set filtration, this is going to cost at least
> $10,000-20,000 USD to finish for my purposes, and more every time the
> list is updated.  It would be much better to get really solid
> algorithms together so that nobody has to incur costs on this
> magnitude :)
>

Ultimately humans have to validate the metadata. You need an
inter-annotator agreement. In chemistry we found that the maximum agreement
between expert human chemists was 93% for whether a phrase was a chemical
or not. Machines by definition cannot do better than this.

It's tempting to develop crowdsourcing for annotation but it's important
that the crowd is part of the project, not just passive slaves.

>
>
> --
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-access/attachments/20120818/4118803e/attachment.html>


More information about the open-access mailing list