[open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Tom Roche Tom_Roche at pobox.com
Fri Aug 17 10:15:47 UTC 2012

Dunno if the following is OT for this group, but thought this thread
from the local PUG might be of interest. (Note I don't know the
author personally; reply to him, not me.)

> Nathan Rice nathan.alexander.rice at gmail.com
> Thu Aug 16 20:31:00 CEST 2012

> Hi All,

> Normally, my projects are pretty boring, and I prefer to endure the
> suffering in solitary silence. As luck would have it though, I
> actually have an interesting project on my plate currently, and I
> think it is cool enough that I wanted to give other people the
> opportunity to stick their noses in and provide input or play with
> some code.

> I am currently involved in compiling a database of medical data
> (published clinical or pre-clinical trials) surrounding ethno- and
> alternative- medicinal treatments, for semi-automated meta analysis
> and treatment guidance. In order for this to work, a lot of
> technical challenges have to be overcome:

> My initial tally from PubMed puts the number of articles at over
> 70,000; based on visual inspection, many of these are not actually
> applicable, but there are limited filtering options via the Entrez
> web API. Machine learning techniques would probably be very helpful
> at scoring articles for applicability, and ignoring ones that are
> clearly inapplicable.

> In order to perform meta-analysis and treatment guidance, the
> article needs to be mined for treatment, condition, circumstances of
> treatment and condition, and whether it was successful or not (with
> some p value and sample size). Most of this is not available as
> standard metadata for the studies, and must be mined from the text.

> In addition, not all studies are equal. Methodological errors, lack
> of reproduciblity, and so forth can all render a study meaningless.
> Thus, studies must have a scoring mechanism so you can avoid
> tainting meta-analyses with biased data. This scoring mechanism will
> probably include the impact factor of the journal, the g/h-index of
> the authors, the number of (non self) citations, etc.

> As you can see, each of these is meaty, and all of them need to be
> taken care of to get good results :) If anyone is interested in
> getting some serious natural language processing/data mining/machine
> learning practice, I'd love to involve you. There's no reason I
> should have all the fun!

> I'm still in the planning stages for most of the stuff; I have the
> pubmed extraction code pretty well nailed, and I have a solid
> outline for the article disqualification (create a feature vector
> out of topic and abstract bigrams, MeSH subject headings and
> journal, use a SVM discriminator and manually generate a RoC curve
> to determine the cutoff score) but I'm still very up in the air
> regarding NL extraction of things like sample size, significance,
> etc. If you'd like to learn more I would of course be happy to go
> over my thoughts on the matter and we can play around with some
> code.

More information about the open-science mailing list