[open-science-dev] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Daniel Lombraña González teleyinex at gmail.com
Fri Aug 17 12:39:50 UTC 2012


Hi,

I think this project could be interesting for PyBossa in the sense that
some data-mining and validation could be done by humans :-) I can give Tom
more details if PyBossa is helpful :-)

Cheers,

Daniel

On Fri, Aug 17, 2012 at 2:09 PM, Jenny Molloy <jcmcoppice12 at gmail.com>wrote:

> Hi All
>
> Apologies for cross-posting but this came out on open-science and I
> thought it might be of interest to some of you as well
>
> Jenny
>
> ---------- Forwarded message ----------
> From: Tom Roche <Tom_Roche at pobox.com>
> Date: Fri, Aug 17, 2012 at 11:15 AM
> Subject: [open-science] fw: Python NLTK/data mining/machine learning
> project of public research data, anyone interested?
> To: open-science at lists.okfn.org
>
>
>
> Dunno if the following is OT for this group, but thought this thread
> from the local PUG might be of interest. (Note I don't know the
> author personally; reply to him, not me.)
>
> http://mail.python.org/pipermail/trizpug/2012-August/001919.html
> > Nathan Rice nathan.alexander.rice at gmail.com
> > Thu Aug 16 20:31:00 CEST 2012
>
> > Hi All,
>
> > Normally, my projects are pretty boring, and I prefer to endure the
> > suffering in solitary silence. As luck would have it though, I
> > actually have an interesting project on my plate currently, and I
> > think it is cool enough that I wanted to give other people the
> > opportunity to stick their noses in and provide input or play with
> > some code.
>
> > I am currently involved in compiling a database of medical data
> > (published clinical or pre-clinical trials) surrounding ethno- and
> > alternative- medicinal treatments, for semi-automated meta analysis
> > and treatment guidance. In order for this to work, a lot of
> > technical challenges have to be overcome:
>
> > My initial tally from PubMed puts the number of articles at over
> > 70,000; based on visual inspection, many of these are not actually
> > applicable, but there are limited filtering options via the Entrez
> > web API. Machine learning techniques would probably be very helpful
> > at scoring articles for applicability, and ignoring ones that are
> > clearly inapplicable.
>
> > In order to perform meta-analysis and treatment guidance, the
> > article needs to be mined for treatment, condition, circumstances of
> > treatment and condition, and whether it was successful or not (with
> > some p value and sample size). Most of this is not available as
> > standard metadata for the studies, and must be mined from the text.
>
> > In addition, not all studies are equal. Methodological errors, lack
> > of reproduciblity, and so forth can all render a study meaningless.
> > Thus, studies must have a scoring mechanism so you can avoid
> > tainting meta-analyses with biased data. This scoring mechanism will
> > probably include the impact factor of the journal, the g/h-index of
> > the authors, the number of (non self) citations, etc.
>
> > As you can see, each of these is meaty, and all of them need to be
> > taken care of to get good results :) If anyone is interested in
> > getting some serious natural language processing/data mining/machine
> > learning practice, I'd love to involve you. There's no reason I
> > should have all the fun!
>
> http://mail.python.org/pipermail/trizpug/2012-August/001920.html
> > I'm still in the planning stages for most of the stuff; I have the
> > pubmed extraction code pretty well nailed, and I have a solid
> > outline for the article disqualification (create a feature vector
> > out of topic and abstract bigrams, MeSH subject headings and
> > journal, use a SVM discriminator and manually generate a RoC curve
> > to determine the cutoff score) but I'm still very up in the air
> > regarding NL extraction of things like sample size, significance,
> > etc. If you'd like to learn more I would of course be happy to go
> > over my thoughts on the matter and we can play around with some
> > code.
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>
>
> _______________________________________________
> open-science-dev mailing list
> open-science-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science-dev
>
>


-- 
··························································································································································
http://github.com/teleyinex
http://www.flickr.com/photos/teleyinex
··························································································································································
Por favor, NO utilice formatos de archivo propietarios para el
intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
o cualquier otro que no obligue a utilizar un programa de un
fabricante concreto para tratar la información contenida en él.
··························································································································································
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20120817/ec7993d4/attachment.html>


More information about the open-science-dev mailing list