[Open-access] [open-science-dev] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Peter Murray-Rust pm286 at cam.ac.uk
Sat Aug 18 15:50:18 UTC 2012


On Sat, Aug 18, 2012 at 11:38 AM, Jenny Molloy <jcmcoppice12 at gmail.com>wrote:

> Thanks Daniel!
>
> The original email was from Nathan Rice who I've cc'd into the thread, if
> you could reply to him.
>
> Nathan - Tom forwarded your email to the Open Knowledge Foundation lists,
> we have some people interested in text and data mining on board so
> hopefully they'll be able to offer some advice/assistance! We run periodic
> hack days, in the UK mostly, but one coming up in Helsinki on 18 September
> if you'd be interested to do a remote demo, at the last one there was some
> work on content mining phylogenetic trees
> http://rossmounce.co.uk/2012/07/17/content-mining-for-phylogenetic-data/
>
>
This is really exciting. Be aware, Nathan, that if you come up with a good
idea (which you have) you are likely to end up making it work!

I am copying the three lists because this is still very general but we
should condense it later to more specialist topics.

There is a real need for coordinating content-mining. There are a  number
of threads:
* tools to systematically retrieve "documents", such as PubCrawler whcih
crawls publishers' sites. At some stage we shall press the button and
collect a large amount of metadata. This distributes very nicely with one
scraper per publisher (about 100 in science according to Ross Mounce). We
have scrapers for some of the major ones but will need this for others.
First one required is BMC (which is ultra-legal as it's CC-BY)
* protocols for content-mining - legal and technical
* collation of experiences and groups in content-mining. This is really
important. I've been hacking in this area for 3 months and I am not
connected with other efforts nor they with me (apart from about 3 bio-
ones). We need to bring content-miners together. Since a major barrier is
publisher FUD let's give each other confidence. Also the OKF's Datahub is a
great place to put Open results.
* collation of technologies. There are at least:
  - scrapers
  - PDF hacking (I have done a lot of this but we need more. Open font
info. Postscript reconstruction
  - bit-map hacking for diagram analysis (I am excited by the quality of
modern scientific diagrams and think there can be a lot of automation)
  - shallow natural language processing and NPL resources (e.g. vocabs,
character sets)
  - classification techniques (e.g. Lucene/Solr) for text and diagrams

I think if we harness all these we will have a large step change in the
automation of extraction of scientific information from "the literature".

And one-by-one the publishers will come to us because they will need us.

Timescale - about 1 year to have something major to report - about 5 years
to change the way scientific information is managed.

> Jenny
>
> On Fri, Aug 17, 2012 at 1:39 PM, Daniel Lombraña González <
> teleyinex at gmail.com> wrote:
>
>> Hi,
>>
>> I think this project could be interesting for PyBossa in the sense that
>> some data-mining and validation could be done by humans :-) I can give Tom
>> more details if PyBossa is helpful :-)
>>
>> Cheers,
>>
>> Daniel
>>
>> On Fri, Aug 17, 2012 at 2:09 PM, Jenny Molloy <jcmcoppice12 at gmail.com>wrote:
>>
>>> Hi All
>>>
>>> Apologies for cross-posting but this came out on open-science and I
>>> thought it might be of interest to some of you as well
>>>
>>> Jenny
>>>
>>> ---------- Forwarded message ----------
>>> From: Tom Roche <Tom_Roche at pobox.com>
>>> Date: Fri, Aug 17, 2012 at 11:15 AM
>>> Subject: [open-science] fw: Python NLTK/data mining/machine learning
>>> project of public research data, anyone interested?
>>> To: open-science at lists.okfn.org
>>>
>>>
>>>
>>> Dunno if the following is OT for this group, but thought this thread
>>> from the local PUG might be of interest. (Note I don't know the
>>> author personally; reply to him, not me.)
>>>
>>> http://mail.python.org/pipermail/trizpug/2012-August/001919.html
>>> > Nathan Rice nathan.alexander.rice at gmail.com
>>> > Thu Aug 16 20:31:00 CEST 2012
>>>
>>> > Hi All,
>>>
>>> > Normally, my projects are pretty boring, and I prefer to endure the
>>> > suffering in solitary silence. As luck would have it though, I
>>> > actually have an interesting project on my plate currently, and I
>>> > think it is cool enough that I wanted to give other people the
>>> > opportunity to stick their noses in and provide input or play with
>>> > some code.
>>>
>>> > I am currently involved in compiling a database of medical data
>>> > (published clinical or pre-clinical trials) surrounding ethno- and
>>> > alternative- medicinal treatments, for semi-automated meta analysis
>>> > and treatment guidance. In order for this to work, a lot of
>>> > technical challenges have to be overcome:
>>>
>>> > My initial tally from PubMed puts the number of articles at over
>>> > 70,000; based on visual inspection, many of these are not actually
>>> > applicable, but there are limited filtering options via the Entrez
>>> > web API. Machine learning techniques would probably be very helpful
>>> > at scoring articles for applicability, and ignoring ones that are
>>> > clearly inapplicable.
>>>
>>> > In order to perform meta-analysis and treatment guidance, the
>>> > article needs to be mined for treatment, condition, circumstances of
>>> > treatment and condition, and whether it was successful or not (with
>>> > some p value and sample size). Most of this is not available as
>>> > standard metadata for the studies, and must be mined from the text.
>>>
>>> > In addition, not all studies are equal. Methodological errors, lack
>>> > of reproduciblity, and so forth can all render a study meaningless.
>>> > Thus, studies must have a scoring mechanism so you can avoid
>>> > tainting meta-analyses with biased data. This scoring mechanism will
>>> > probably include the impact factor of the journal, the g/h-index of
>>> > the authors, the number of (non self) citations, etc.
>>>
>>> > As you can see, each of these is meaty, and all of them need to be
>>> > taken care of to get good results :) If anyone is interested in
>>> > getting some serious natural language processing/data mining/machine
>>> > learning practice, I'd love to involve you. There's no reason I
>>> > should have all the fun!
>>>
>>> http://mail.python.org/pipermail/trizpug/2012-August/001920.html
>>> > I'm still in the planning stages for most of the stuff; I have the
>>> > pubmed extraction code pretty well nailed, and I have a solid
>>> > outline for the article disqualification (create a feature vector
>>> > out of topic and abstract bigrams, MeSH subject headings and
>>> > journal, use a SVM discriminator and manually generate a RoC curve
>>> > to determine the cutoff score) but I'm still very up in the air
>>> > regarding NL extraction of things like sample size, significance,
>>> > etc. If you'd like to learn more I would of course be happy to go
>>> > over my thoughts on the matter and we can play around with some
>>> > code.
>>>
>>> _______________________________________________
>>> open-science mailing list
>>> open-science at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/open-science
>>>
>>>
>>> _______________________________________________
>>> open-science-dev mailing list
>>> open-science-dev at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/open-science-dev
>>>
>>>
>>
>>
>> --
>>
>> ··························································································································································
>> http://github.com/teleyinex
>> http://www.flickr.com/photos/teleyinex
>>
>> ··························································································································································
>> Por favor, NO utilice formatos de archivo propietarios para el
>> intercambio de documentos, como DOC y XLS, sino PDF, HTML, RTF, TXT, CSV
>> o cualquier otro que no obligue a utilizar un programa de un
>> fabricante concreto para tratar la información contenida en él.
>>
>> ··························································································································································
>>
>
>
> _______________________________________________
> open-access mailing list
> open-access at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-access
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-access/attachments/20120818/ff1dfea8/attachment.html>


More information about the open-access mailing list