[open-science] Content Mining Workshop

Peter Murray-Rust pm286 at cam.ac.uk
Sat Nov 3 22:21:47 UTC 2012


More thoughts.

No one else is doing this (coordinating resources), but I feel there is a
large hidden collection of people who want to do content-mining. They need
a community and we are well placed to do this. The objective drivers are:
* introductions to what is technically possible
* overview of protocols
* lists of existing software (crawlers, parsers, semantics, machine
learning...)
* lists of existing online resources (mining sites, name servers,
geotaggers, etc.)
* how to link results

There are some out-of-the-box solutions but many require research and
customisation for each source / domain. It may be necessary to train
Natural Language Processors and Machine Learning. This can strongly benefit
from community - there are many opportunities for annotation and we should
be exploring annottaions tools(including the OKF's).

Also this overlaps with the "lefthand end" of the School of Data and we
should be exchanging information on a regular basis.

For myself I am currently developing a PDF2SVG toolkit (under AMI2). This
has the goal of completely transforming all PDF to a simpler and more
uniform representation. I am blogging this semi-regularly ,
http://blogs.ch.cam.ac.uk/pmr. Again there is great scope for involvement
to alpha test and this could be a useful way of people getting into the PDF
area.

P.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121103/dd548e7b/attachment-0001.html>


More information about the open-science mailing list