[open-science] Content Mining Workshop

Sun Nov 4 02:03:20 UTC 2012

It would be nice to have a list of research questions/projects that involve or could benefit from content mining. A heavy-duty example is below, but I am sure there are several research topics that could benefit from commonly available tools and techniques as well as perhaps crowd/community participation.

Here is an example of a content mining project some of my colleagues are working on

http://earthcube.ning.com/group/dark-geodata-concept-award/forum/topics/geodeepdive-bringing-dark-data-to-light

More info on the project at http://hazy.cs.wisc.edu/hazy/geodeepdive/

On Nov 3, 2012, at 3:21 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> More thoughts.
> 
> No one else is doing this (coordinating resources), but I feel there is a
> large hidden collection of people who want to do content-mining. They need
> a community and we are well placed to do this. The objective drivers are:
> * introductions to what is technically possible
> * overview of protocols
> * lists of existing software (crawlers, parsers, semantics, machine
> learning...)
> * lists of existing online resources (mining sites, name servers,
> geotaggers, etc.)
> * how to link results
> 
> There are some out-of-the-box solutions but many require research and
> customisation for each source / domain. It may be necessary to train
> Natural Language Processors and Machine Learning. This can strongly benefit
> from community - there are many opportunities for annotation and we should
> be exploring annottaions tools(including the OKF's).
> 
> Also this overlaps with the "lefthand end" of the School of Data and we
> should be exchanging information on a regular basis.
> 
> For myself I am currently developing a PDF2SVG toolkit (under AMI2). This
> has the goal of completely transforming all PDF to a simpler and more
> uniform representation. I am blogging this semi-regularly ,
> http://blogs.ch.cam.ac.uk/pmr. Again there is great scope for involvement
> to alpha test and this could be a useful way of people getting into the PDF
> area.
> 
> P.
> 

--
Puneet Kishor
Science and Data Policy at Creative Commons