[open-science-dev] [Open-access] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Nathan Rice nathan.alexander.rice at gmail.com
Sat Aug 18 21:14:03 UTC 2012


>> Thanks Daniel!
>>
>> The original email was from Nathan Rice who I've cc'd into the thread, if
>> you could reply to him.
>>
>> Nathan - Tom forwarded your email to the Open Knowledge Foundation lists,
>> we have some people interested in text and data mining on board so hopefully
>> they'll be able to offer some advice/assistance! We run periodic hack days,
>> in the UK mostly, but one coming up in Helsinki on 18 September if you'd be
>> interested to do a remote demo, at the last one there was some work on
>> content mining phylogenetic trees
>> http://rossmounce.co.uk/2012/07/17/content-mining-for-phylogenetic-data/
>>

Wow, I'm surprised this has made its way around as much as it has.  I
suppose if a project makes a jaded bioinformatics guy like me excited,
it shouldn't surprise me that others would find it interesting too.

> This is really exciting. Be aware, Nathan, that if you come up with a good
> idea (which you have) you are likely to end up making it work!

I'm still a little bit intimidated by the amount of work that will be
involved in getting a really solid, fully automated pipeline.  I'm
trying to take it a step at a time.  I've almost finished a curated
list of plants, experimental molecules and compounds, and I'm fine
tuning the pubmed search code to reduce the initial signal to noise
ratio.

I'm still not sure exactly how I want to go about selecting articles
to use as the training data set for article filtration.  A manually
curated list would probably work best, but given the number of
features that are available, I expect that the training set would need
to be at least 1,000 articles in size to get decent results.  This
might just be one of those cases where I need to bite the bullet, put
a large pot of coffee on, and get to work.

> I am copying the three lists because this is still very general but we
> should condense it later to more specialist topics.
>
> There is a real need for coordinating content-mining. There are a  number of
> threads:
> * tools to systematically retrieve "documents", such as PubCrawler whcih
> crawls publishers' sites. At some stage we shall press the button and
> collect a large amount of metadata. This distributes very nicely with one
> scraper per publisher (about 100 in science according to Ross Mounce). We
> have scrapers for some of the major ones but will need this for others.
> First one required is BMC (which is ultra-legal as it's CC-BY)
> * protocols for content-mining - legal and technical
> * collation of experiences and groups in content-mining. This is really
> important. I've been hacking in this area for 3 months and I am not
> connected with other efforts nor they with me (apart from about 3 bio-
> ones). We need to bring content-miners together. Since a major barrier is
> publisher FUD let's give each other confidence. Also the OKF's Datahub is a
> great place to put Open results.
> * collation of technologies. There are at least:
>   - scrapers
>   - PDF hacking (I have done a lot of this but we need more. Open font info.
> Postscript reconstruction

I have played with this a bit, one issue that is frustrating is many
PDFanalysis tools will randomly insert spaces due to font kerning, and
will order text based on vertical position on the page, rather than
preserving column order.  If there is a PDF text extraction tool that
doesn't do these I would love to know.

>   - bit-map hacking for diagram analysis (I am excited by the quality of
> modern scientific diagrams and think there can be a lot of automation)

Although this is slightly outside the scope of what I'm doing, I do
agree this is very interesting.

>   - shallow natural language processing and NPL resources (e.g. vocabs,
> character sets)
>   - classification techniques (e.g. Lucene/Solr) for text and diagrams
>
> I think if we harness all these we will have a large step change in the
> automation of extraction of scientific information from "the literature".
>
> And one-by-one the publishers will come to us because they will need us.

It is really a shame that metadata isn't more standardized for journal
articles.  Pubmed MeSH terms and chemical lists are OK but there is so
much more that could be annotated for the article.

> Timescale - about 1 year to have something major to report - about 5 years
> to change the way scientific information is managed.

Scientific articles seem like the perfect place for semantic metadata.
 In particular, clinical trial articles should have a nice, standard
set of metadata artifacts for computer analysis, since they are so
cookie cutter.

>> Jenny
>>
>> On Fri, Aug 17, 2012 at 1:39 PM, Daniel Lombraña González
>> <teleyinex at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I think this project could be interesting for PyBossa in the sense that
>>> some data-mining and validation could be done by humans :-) I can give Tom
>>> more details if PyBossa is helpful :-)

I have actually already invested some of my (unfortunately scant)
resources into having people go through mined pubmed articles and
create metadata annotations.  Unfortunately, without a lot of machine
learning input set filtration, this is going to cost at least
$10,000-20,000 USD to finish for my purposes, and more every time the
list is updated.  It would be much better to get really solid
algorithms together so that nobody has to incur costs on this
magnitude :)


>>> Cheers,
>>>
>>> Daniel
>>>
>>> On Fri, Aug 17, 2012 at 2:09 PM, Jenny Molloy <jcmcoppice12 at gmail.com>
>>> wrote:
>>>>
>>>> Hi All
>>>>
>>>> Apologies for cross-posting but this came out on open-science and I
>>>> thought it might be of interest to some of you as well
>>>>
>>>> Jenny
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Tom Roche <Tom_Roche at pobox.com>
>>>> Date: Fri, Aug 17, 2012 at 11:15 AM
>>>> Subject: [open-science] fw: Python NLTK/data mining/machine learning
>>>> project of public research data, anyone interested?
>>>> To: open-science at lists.okfn.org
>>>>
>>>>
>>>>
>>>> Dunno if the following is OT for this group, but thought this thread
>>>> from the local PUG might be of interest. (Note I don't know the
>>>> author personally; reply to him, not me.)
>>>>
>>>> http://mail.python.org/pipermail/trizpug/2012-August/001919.html
>>>> > Nathan Rice nathan.alexander.rice at gmail.com
>>>> > Thu Aug 16 20:31:00 CEST 2012
>>>>
>>>> > Hi All,
>>>>
>>>> > Normally, my projects are pretty boring, and I prefer to endure the
>>>> > suffering in solitary silence. As luck would have it though, I
>>>> > actually have an interesting project on my plate currently, and I
>>>> > think it is cool enough that I wanted to give other people the
>>>> > opportunity to stick their noses in and provide input or play with
>>>> > some code.
>>>>
>>>> > I am currently involved in compiling a database of medical data
>>>> > (published clinical or pre-clinical trials) surrounding ethno- and
>>>> > alternative- medicinal treatments, for semi-automated meta analysis
>>>> > and treatment guidance. In order for this to work, a lot of
>>>> > technical challenges have to be overcome:
>>>>
>>>> > My initial tally from PubMed puts the number of articles at over
>>>> > 70,000; based on visual inspection, many of these are not actually
>>>> > applicable, but there are limited filtering options via the Entrez
>>>> > web API. Machine learning techniques would probably be very helpful
>>>> > at scoring articles for applicability, and ignoring ones that are
>>>> > clearly inapplicable.
>>>>
>>>> > In order to perform meta-analysis and treatment guidance, the
>>>> > article needs to be mined for treatment, condition, circumstances of
>>>> > treatment and condition, and whether it was successful or not (with
>>>> > some p value and sample size). Most of this is not available as
>>>> > standard metadata for the studies, and must be mined from the text.
>>>>
>>>> > In addition, not all studies are equal. Methodological errors, lack
>>>> > of reproduciblity, and so forth can all render a study meaningless.
>>>> > Thus, studies must have a scoring mechanism so you can avoid
>>>> > tainting meta-analyses with biased data. This scoring mechanism will
>>>> > probably include the impact factor of the journal, the g/h-index of
>>>> > the authors, the number of (non self) citations, etc.
>>>>
>>>> > As you can see, each of these is meaty, and all of them need to be
>>>> > taken care of to get good results :) If anyone is interested in
>>>> > getting some serious natural language processing/data mining/machine
>>>> > learning practice, I'd love to involve you. There's no reason I
>>>> > should have all the fun!
>>>>
>>>> http://mail.python.org/pipermail/trizpug/2012-August/001920.html
>>>> > I'm still in the planning stages for most of the stuff; I have the
>>>> > pubmed extraction code pretty well nailed, and I have a solid
>>>> > outline for the article disqualification (create a feature vector
>>>> > out of topic and abstract bigrams, MeSH subject headings and
>>>> > journal, use a SVM discriminator and manually generate a RoC curve
>>>> > to determine the cutoff score) but I'm still very up in the air
>>>> > regarding NL extraction of things like sample size, significance,
>>>> > etc. If you'd like to learn more I would of course be happy to go
>>>> > over my thoughts on the matter and we can play around with some
>>>> > code.
>>>>
>>>> _______________________________________________
>>>> open-science mailing list
>>>> open-science at lists.okfn.org
>>>> http://lists.okfn.org/mailman/listinfo/open-science
>>>>
>>>>
>>>> _______________________________________________
>>>> open-science-dev mailing list
>>>> open-science-dev at lists.okfn.org
>>>> http://lists.okfn.org/mailman/listinfo/open-science-dev
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ··························································································································································
>>> http://github.com/teleyinex
>>> http://www.flickr.com/photos/teleyinex
>>>
>>> ··························································································································································
>>> Por favor, NO utilice formatos de archivo propietarios para el
>>> intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
>>> o cualquier otro que no obligue a utilizar un programa de un
>>> fabricante concreto para tratar la información contenida en él.
>>>
>>> ··························································································································································
>>
>>

Thanks again all,

Nathan
>>
>> _______________________________________________
>> open-access mailing list
>> open-access at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-access
>>
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069



More information about the open-science-dev mailing list