[open-science-dev] [Open-access] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Peter Murray-Rust pm286 at cam.ac.uk
Mon Aug 20 11:50:01 UTC 2012


On Mon, Aug 20, 2012 at 8:21 AM, Laurent Romary <laurent.romary at inria.fr>wrote:

> Dear all,
> Catching up this thread to answer on two issues:
> 1. Metadata formats for scholarly papers: this is something we have been
> extensively been working on in the context of the EU PEER project (
> http://www.peerproject.eu/) where we mapped heterogeneous metadata coming
> from quite a wide spectrum of commercial STM publishers onto a single TEI
> (Text Encoding Initiative) based format.  The format is documented under
> http://hal.inria.fr/hal-00659856 and we did a short communication (with
> slides) about it at the 2010 TEI conference:
> http://hal.inria.fr/inria-00537302 Using the TEI (an open standards) as a
> reference brings the work on scholarly document into a wider corpus of
> encoded texts (see http://www.tei-c.org/Activities/Projects/ just to get
> a partial overview of the TEI community)
>

Sounds good - especially since we can use the PEER project as an effective
authority.  Mark, etc. - how does this map onto BibJSON? TEI used to be
SGML and very complex - I hope this is a subset and can be used as XML
without the need for an SGML parser.

2. We (Patrice Lopez, CCed) developed an open source library of metadata,
> but also full text extraction tools which creates a TEI based
> representation of a pdf given as input. This could be particularly
> interesting for your work I think, and an opportunity to share forces
> there.  See https://sourceforge.net/projects/grobid/
>

Java - great for me! I'll checkout and look later but this seems very
valuable. This is exactly what we need to interface with PubCrawler (unless
you also have a crawler). We can then  download metadata from all the
publishers - we already have scrapers for several. That gives us the
metadata framework for then managing Open bibliography and using it.




Cheers,
> Laurent
>
> Le 18 août 2012 à 23:38, Peter Murray-Rust a écrit :
>
> Thanks, Nathan.
> No one is going to put pressure on you to do anything - but the
> oppoertunity for jointly coalescing projects is there.
>
> On Sat, Aug 18, 2012 at 10:14 PM, Nathan Rice <
> nathan.alexander.rice at gmail.com> wrote:
>
>>
>> Wow, I'm surprised this has made its way around as much as it has.  I
>> suppose if a project makes a jaded bioinformatics guy like me excited,
>> it shouldn't surprise me that others would find it interesting too.
>>
>> There is huge potential in automation which is why it's exciting. I
> realise I didn't answer your original question - see later.
>
> I'm still a little bit intimidated by the amount of work that will be
>> involved in getting a really solid, fully automated pipeline.
>
>
> Then take things at a pace that can be managed. No-one has to be a hero by
> themselves.
>
>
>> I'm
>> trying to take it a step at a time.  I've almost finished a curated
>> list of plants, experimental molecules and compounds, and I'm fine
>> tuning the pubmed search code to reduce the initial signal to noise
>> ratio.
>>
>
> It may be that when you expose those you will find overlap with other
> people.
>
>>
>> I'm still not sure exactly how I want to go about selecting articles
>> to use as the training data set for article filtration.  A manually
>> curated list would probably work best, but given the number of
>> features that are available, I expect that the training set would need
>> to be at least 1,000 articles in size to get decent results.  This
>> might just be one of those cases where I need to bite the bullet, put
>> a large pot of coffee on, and get to work.
>>
>> To do content mining properly requires a considerable annotated corpus.
> Generally it's split 3 ways - training, testing and validation. But such a
> corpus is very valuable. Unfortunately copyright normally means it can't be
> redistributed (I've had this fight with publishers). However that will
> change as they realise that alienating the world won't work as they aren't
> very competent totalitarians.
>
>
>> >   - PDF hacking (I have done a lot of this but we need more. Open font
>> info.
>> > Postscript reconstruction
>>
>> I have played with this a bit, one issue that is frustrating is many
>> PDFanalysis tools will randomly insert spaces due to font kerning, and
>> will order text based on vertical position on the page, rather than
>> preserving column order.  If there is a PDF text extraction tool that
>> doesn't do these I would love to know.
>>
>
> I work with PDFBox and pull this out character by character. I throw away
> all sequential information and only use coordinates and font-size.  This
> works pretty well for me. I can see some excessive kernings and ligatures
> may defeat it but at present I suspect I get less than 1 spurious space per
> 1000 chars. And remember we also have vocabularies to help tune this.
>
>>
>>
>> >   - shallow natural language processing and NPL resources (e.g. vocabs,
>> > character sets)
>> >   - classification techniques (e.g. Lucene/Solr) for text and diagrams
>> >
>> > I think if we harness all these we will have a large step change in the
>> > automation of extraction of scientific information from "the
>> literature".
>> >
>> > And one-by-one the publishers will come to us because they will need us.
>>
>> It is really a shame that metadata isn't more standardized for journal
>> articles.
>
>
> We have been addressing this in the Open biblio project(s). BibJSON acts
> as an unofficial normalization of article metadata. If you mean
> domain-specific metadata then we have to do this ourselves - and I am
> confident we can - it will be better than keywords (I have little faith in
> them)
>
>
>>  Pubmed MeSH terms and chemical lists are OK but there is so
>> much more that could be annotated for the article.
>>
>> I am very interested in generic classifiers at this level.
>
>
>
>> > Timescale - about 1 year to have something major to report - about 5
>> years
>> > to change the way scientific information is managed.
>>
>> Scientific articles seem like the perfect place for semantic metadata.
>>  In particular, clinical trial articles should have a nice, standard
>> set of metadata artifacts for computer analysis, since they are so
>> cookie cutter.
>>
>
> I looked at this 2-3 years ago - for clinical trials on nutrition. IIRC
> the abstracts were very useful metadata - they were structured and used
> standard-ish terms. I think they could be NLP'ed quite well.
>
>>
>>
>> I have actually already invested some of my (unfortunately scant)
>> resources into having people go through mined pubmed articles and
>> create metadata annotations.  Unfortunately, without a lot of machine
>> learning input set filtration, this is going to cost at least
>> $10,000-20,000 USD to finish for my purposes, and more every time the
>> list is updated.  It would be much better to get really solid
>> algorithms together so that nobody has to incur costs on this
>> magnitude :)
>>
>
> Ultimately humans have to validate the metadata. You need an
> inter-annotator agreement. In chemistry we found that the maximum agreement
> between expert human chemists was 93% for whether a phrase was a chemical
> or not. Machines by definition cannot do better than this.
>
> It's tempting to develop crowdsourcing for annotation but it's important
> that the crowd is part of the project, not just passive slaves.
>
>>
>>
>> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> open-access mailing list
> open-access at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-access
>
>
> Laurent Romary
> INRIA & HUB-IDSL
> laurent.romary at inria.fr
>
>
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20120820/e39ddb75/attachment.html>


More information about the open-science-dev mailing list