[open-science-dev] [Open-access] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Mon Aug 20 06:17:19 UTC 2012

Hi,

I think that Peter has raised several good points about human validation
and how they should not be treated like "monkeys" working for the project
:-)

>From CCC we love to say that instead of crowdsourcing a project what we
want "crowdcrafting" all the people participating in the project will build
this project because it is important to them.

In any case, maybe you will have to involve people in the later stages of
your project for validating some data, so if you want, you could use
PyBossa to do this. If you need more info about the project, let me know it.

Cheers,

Daniel

On Sat, Aug 18, 2012 at 11:38 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> Thanks, Nathan.
> No one is going to put pressure on you to do anything - but the
> oppoertunity for jointly coalescing projects is there.
>
> On Sat, Aug 18, 2012 at 10:14 PM, Nathan Rice <
> nathan.alexander.rice at gmail.com> wrote:
>
>>
>> Wow, I'm surprised this has made its way around as much as it has.  I
>> suppose if a project makes a jaded bioinformatics guy like me excited,
>> it shouldn't surprise me that others would find it interesting too.
>>
>> There is huge potential in automation which is why it's exciting. I
> realise I didn't answer your original question - see later.
>
> I'm still a little bit intimidated by the amount of work that will be
>> involved in getting a really solid, fully automated pipeline.
>
>
> Then take things at a pace that can be managed. No-one has to be a hero by
> themselves.
>
>
>> I'm
>> trying to take it a step at a time.  I've almost finished a curated
>> list of plants, experimental molecules and compounds, and I'm fine
>> tuning the pubmed search code to reduce the initial signal to noise
>> ratio.
>>
>
> It may be that when you expose those you will find overlap with other
> people.
>
>>
>> I'm still not sure exactly how I want to go about selecting articles
>> to use as the training data set for article filtration.  A manually
>> curated list would probably work best, but given the number of
>> features that are available, I expect that the training set would need
>> to be at least 1,000 articles in size to get decent results.  This
>> might just be one of those cases where I need to bite the bullet, put
>> a large pot of coffee on, and get to work.
>>
>> To do content mining properly requires a considerable annotated corpus.
> Generally it's split 3 ways - training, testing and validation. But such a
> corpus is very valuable. Unfortunately copyright normally means it can't be
> redistributed (I've had this fight with publishers). However that will
> change as they realise that alienating the world won't work as they aren't
> very competent totalitarians.
>
>
>> >   - PDF hacking (I have done a lot of this but we need more. Open font
>> info.
>> > Postscript reconstruction
>>
>> I have played with this a bit, one issue that is frustrating is many
>> PDFanalysis tools will randomly insert spaces due to font kerning, and
>> will order text based on vertical position on the page, rather than
>> preserving column order.  If there is a PDF text extraction tool that
>> doesn't do these I would love to know.
>>
>
> I work with PDFBox and pull this out character by character. I throw away
> all sequential information and only use coordinates and font-size.  This
> works pretty well for me. I can see some excessive kernings and ligatures
> may defeat it but at present I suspect I get less than 1 spurious space per
> 1000 chars. And remember we also have vocabularies to help tune this.
>
>>
>>
>> >   - shallow natural language processing and NPL resources (e.g. vocabs,
>> > character sets)
>> >   - classification techniques (e.g. Lucene/Solr) for text and diagrams
>> >
>> > I think if we harness all these we will have a large step change in the
>> > automation of extraction of scientific information from "the
>> literature".
>> >
>> > And one-by-one the publishers will come to us because they will need us.
>>
>> It is really a shame that metadata isn't more standardized for journal
>> articles.
>
>
> We have been addressing this in the Open biblio project(s). BibJSON acts
> as an unofficial normalization of article metadata. If you mean
> domain-specific metadata then we have to do this ourselves - and I am
> confident we can - it will be better than keywords (I have little faith in
> them)
>
>
>>  Pubmed MeSH terms and chemical lists are OK but there is so
>> much more that could be annotated for the article.
>>
>> I am very interested in generic classifiers at this level.
>
>
>
>>  > Timescale - about 1 year to have something major to report - about 5
>> years
>> > to change the way scientific information is managed.
>>
>> Scientific articles seem like the perfect place for semantic metadata.
>>  In particular, clinical trial articles should have a nice, standard
>> set of metadata artifacts for computer analysis, since they are so
>> cookie cutter.
>>
>
> I looked at this 2-3 years ago - for clinical trials on nutrition. IIRC
> the abstracts were very useful metadata - they were structured and used
> standard-ish terms. I think they could be NLP'ed quite well.
>
>>
>>
>> I have actually already invested some of my (unfortunately scant)
>> resources into having people go through mined pubmed articles and
>> create metadata annotations.  Unfortunately, without a lot of machine
>> learning input set filtration, this is going to cost at least
>> $10,000-20,000 USD to finish for my purposes, and more every time the
>> list is updated.  It would be much better to get really solid
>> algorithms together so that nobody has to incur costs on this
>> magnitude :)
>>
>
> Ultimately humans have to validate the metadata. You need an
> inter-annotator agreement. In chemistry we found that the maximum agreement
> between expert human chemists was 93% for whether a phrase was a chemical
> or not. Machines by definition cannot do better than this.
>
> It's tempting to develop crowdsourcing for annotation but it's important
> that the crowd is part of the project, not just passive slaves.
>
>>
>>
>> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

-- 
··························································································································································
http://github.com/teleyinex
http://www.flickr.com/photos/teleyinex
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20120820/05de1284/attachment.html>