[open-science-dev] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Sat Aug 18 10:38:56 UTC 2012

Thanks Daniel!

The original email was from Nathan Rice who I've cc'd into the thread, if
you could reply to him.

Nathan - Tom forwarded your email to the Open Knowledge Foundation lists,
we have some people interested in text and data mining on board so
hopefully they'll be able to offer some advice/assistance! We run periodic
hack days, in the UK mostly, but one coming up in Helsinki on 18 September
if you'd be interested to do a remote demo, at the last one there was some
work on content mining phylogenetic trees
http://rossmounce.co.uk/2012/07/17/content-mining-for-phylogenetic-data/

Jenny

On Fri, Aug 17, 2012 at 1:39 PM, Daniel Lombraña González <
teleyinex at gmail.com> wrote:

> Hi,
>
> I think this project could be interesting for PyBossa in the sense that
> some data-mining and validation could be done by humans :-) I can give Tom
> more details if PyBossa is helpful :-)
>
> Cheers,
>
> Daniel
>
> On Fri, Aug 17, 2012 at 2:09 PM, Jenny Molloy <jcmcoppice12 at gmail.com>wrote:
>
>> Hi All
>>
>> Apologies for cross-posting but this came out on open-science and I
>> thought it might be of interest to some of you as well
>>
>> Jenny
>>
>> ---------- Forwarded message ----------
>> From: Tom Roche <Tom_Roche at pobox.com>
>> Date: Fri, Aug 17, 2012 at 11:15 AM
>> Subject: [open-science] fw: Python NLTK/data mining/machine learning
>> project of public research data, anyone interested?
>> To: open-science at lists.okfn.org
>>
>>
>>
>> Dunno if the following is OT for this group, but thought this thread
>> from the local PUG might be of interest. (Note I don't know the
>> author personally; reply to him, not me.)
>>
>> http://mail.python.org/pipermail/trizpug/2012-August/001919.html
>> > Nathan Rice nathan.alexander.rice at gmail.com
>> > Thu Aug 16 20:31:00 CEST 2012
>>
>> > Hi All,
>>
>> > Normally, my projects are pretty boring, and I prefer to endure the
>> > suffering in solitary silence. As luck would have it though, I
>> > actually have an interesting project on my plate currently, and I
>> > think it is cool enough that I wanted to give other people the
>> > opportunity to stick their noses in and provide input or play with
>> > some code.
>>
>> > I am currently involved in compiling a database of medical data
>> > (published clinical or pre-clinical trials) surrounding ethno- and
>> > alternative- medicinal treatments, for semi-automated meta analysis
>> > and treatment guidance. In order for this to work, a lot of
>> > technical challenges have to be overcome:
>>
>> > My initial tally from PubMed puts the number of articles at over
>> > 70,000; based on visual inspection, many of these are not actually
>> > applicable, but there are limited filtering options via the Entrez
>> > web API. Machine learning techniques would probably be very helpful
>> > at scoring articles for applicability, and ignoring ones that are
>> > clearly inapplicable.
>>
>> > In order to perform meta-analysis and treatment guidance, the
>> > article needs to be mined for treatment, condition, circumstances of
>> > treatment and condition, and whether it was successful or not (with
>> > some p value and sample size). Most of this is not available as
>> > standard metadata for the studies, and must be mined from the text.
>>
>> > In addition, not all studies are equal. Methodological errors, lack
>> > of reproduciblity, and so forth can all render a study meaningless.
>> > Thus, studies must have a scoring mechanism so you can avoid
>> > tainting meta-analyses with biased data. This scoring mechanism will
>> > probably include the impact factor of the journal, the g/h-index of
>> > the authors, the number of (non self) citations, etc.
>>
>> > As you can see, each of these is meaty, and all of them need to be
>> > taken care of to get good results :) If anyone is interested in
>> > getting some serious natural language processing/data mining/machine
>> > learning practice, I'd love to involve you. There's no reason I
>> > should have all the fun!
>>
>> http://mail.python.org/pipermail/trizpug/2012-August/001920.html
>> > I'm still in the planning stages for most of the stuff; I have the
>> > pubmed extraction code pretty well nailed, and I have a solid
>> > outline for the article disqualification (create a feature vector
>> > out of topic and abstract bigrams, MeSH subject headings and
>> > journal, use a SVM discriminator and manually generate a RoC curve
>> > to determine the cutoff score) but I'm still very up in the air
>> > regarding NL extraction of things like sample size, significance,
>> > etc. If you'd like to learn more I would of course be happy to go
>> > over my thoughts on the matter and we can play around with some
>> > code.
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>>
>>
>> _______________________________________________
>> open-science-dev mailing list
>> open-science-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science-dev
>>
>>
>
>
> --
>
> ··························································································································································
> http://github.com/teleyinex
> http://www.flickr.com/photos/teleyinex
>
> ··························································································································································
> Por favor, NO utilice formatos de archivo propietarios para el
> intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
> o cualquier otro que no obligue a utilizar un programa de un
> fabricante concreto para tratar la información contenida en él.
>
> ··························································································································································
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20120818/d4c389e6/attachment.html>