[Open-access] [open-science-dev] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Laurent Romary laurent.romary at inria.fr
Mon Aug 20 07:21:39 UTC 2012


Dear all,
Catching up this thread to answer on two issues:
1. Metadata formats for scholarly papers: this is something we have been extensively been working on in the context of the EU PEER project (http://www.peerproject.eu/) where we mapped heterogeneous metadata coming from quite a wide spectrum of commercial STM publishers onto a single TEI (Text Encoding Initiative) based format.  The format is documented under http://hal.inria.fr/hal-00659856 and we did a short communication (with slides) about it at the 2010 TEI conference:  http://hal.inria.fr/inria-00537302 Using the TEI (an open standards) as a reference brings the work on scholarly document into a wider corpus of encoded texts (see http://www.tei-c.org/Activities/Projects/ just to get a partial overview of the TEI community)
2. We (Patrice Lopez, CCed) developed an open source library of metadata, but also full text extraction tools which creates a TEI based representation of a pdf given as input. This could be particularly interesting for your work I think, and an opportunity to share forces there.  See https://sourceforge.net/projects/grobid/
Cheers,
Laurent

Le 18 août 2012 à 23:38, Peter Murray-Rust a écrit :

> Thanks, Nathan.
> No one is going to put pressure on you to do anything - but the oppoertunity for jointly coalescing projects is there.
> 
> On Sat, Aug 18, 2012 at 10:14 PM, Nathan Rice <nathan.alexander.rice at gmail.com> wrote:
> 
> Wow, I'm surprised this has made its way around as much as it has.  I
> suppose if a project makes a jaded bioinformatics guy like me excited,
> it shouldn't surprise me that others would find it interesting too.
> 
> There is huge potential in automation which is why it's exciting. I realise I didn't answer your original question - see later. 
> 
> I'm still a little bit intimidated by the amount of work that will be
> involved in getting a really solid, fully automated pipeline.  
> 
> Then take things at a pace that can be managed. No-one has to be a hero by themselves.
>  
> I'm
> trying to take it a step at a time.  I've almost finished a curated
> list of plants, experimental molecules and compounds, and I'm fine
> tuning the pubmed search code to reduce the initial signal to noise
> ratio.
> 
> It may be that when you expose those you will find overlap with other people. 
> 
> I'm still not sure exactly how I want to go about selecting articles
> to use as the training data set for article filtration.  A manually
> curated list would probably work best, but given the number of
> features that are available, I expect that the training set would need
> to be at least 1,000 articles in size to get decent results.  This
> might just be one of those cases where I need to bite the bullet, put
> a large pot of coffee on, and get to work.
> 
> To do content mining properly requires a considerable annotated corpus. Generally it's split 3 ways - training, testing and validation. But such a corpus is very valuable. Unfortunately copyright normally means it can't be redistributed (I've had this fight with publishers). However that will change as they realise that alienating the world won't work as they aren't very competent totalitarians.
>  
> >   - PDF hacking (I have done a lot of this but we need more. Open font info.
> > Postscript reconstruction
> 
> I have played with this a bit, one issue that is frustrating is many
> PDFanalysis tools will randomly insert spaces due to font kerning, and
> will order text based on vertical position on the page, rather than
> preserving column order.  If there is a PDF text extraction tool that
> doesn't do these I would love to know.
> 
> I work with PDFBox and pull this out character by character. I throw away all sequential information and only use coordinates and font-size.  This works pretty well for me. I can see some excessive kernings and ligatures may defeat it but at present I suspect I get less than 1 spurious space per 1000 chars. And remember we also have vocabularies to help tune this.
> 
> 
> >   - shallow natural language processing and NPL resources (e.g. vocabs,
> > character sets)
> >   - classification techniques (e.g. Lucene/Solr) for text and diagrams
> >
> > I think if we harness all these we will have a large step change in the
> > automation of extraction of scientific information from "the literature".
> >
> > And one-by-one the publishers will come to us because they will need us.
> 
> It is really a shame that metadata isn't more standardized for journal
> articles.
> 
> We have been addressing this in the Open biblio project(s). BibJSON acts as an unofficial normalization of article metadata. If you mean domain-specific metadata then we have to do this ourselves - and I am confident we can - it will be better than keywords (I have little faith in them) 
>  
>  Pubmed MeSH terms and chemical lists are OK but there is so
> much more that could be annotated for the article.
> 
> I am very interested in generic classifiers at this level.
> 
>  
> > Timescale - about 1 year to have something major to report - about 5 years
> > to change the way scientific information is managed.
> 
> Scientific articles seem like the perfect place for semantic metadata.
>  In particular, clinical trial articles should have a nice, standard
> set of metadata artifacts for computer analysis, since they are so
> cookie cutter.
> 
> I looked at this 2-3 years ago - for clinical trials on nutrition. IIRC the abstracts were very useful metadata - they were structured and used standard-ish terms. I think they could be NLP'ed quite well.
> 
> 
> I have actually already invested some of my (unfortunately scant)
> resources into having people go through mined pubmed articles and
> create metadata annotations.  Unfortunately, without a lot of machine
> learning input set filtration, this is going to cost at least
> $10,000-20,000 USD to finish for my purposes, and more every time the
> list is updated.  It would be much better to get really solid
> algorithms together so that nobody has to incur costs on this
> magnitude :)
> 
> Ultimately humans have to validate the metadata. You need an inter-annotator agreement. In chemistry we found that the maximum agreement between expert human chemists was 93% for whether a phrase was a chemical or not. Machines by definition cannot do better than this. 
> 
> It's tempting to develop crowdsourcing for annotation but it's important that the crowd is part of the project, not just passive slaves.
> 
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> open-access mailing list
> open-access at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-access

Laurent Romary
INRIA & HUB-IDSL
laurent.romary at inria.fr



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-access/attachments/20120820/c8822bf6/attachment.html>


More information about the open-access mailing list