[open-science] Content Mining Workshop

Sun Nov 4 21:15:43 UTC 2012

I had a request today asking if there was anywhere simple that people could
start reading about content-mining. I think the answer is no. I don't even
think there are good community sites for specialised text-mining and
data-mining. So this makes it even more important to create some.

The important thing to realise is that many aspects are not deeply
technical. A lot requires us to collate fragmented knowledge. Here's one
where I would welcome input from anyone.

 So one of our first tasks in building AMI2 (intelligent reader of STM
literature) is to find a recent paper from each publisher that is "Open
Access" - i.e. labelled in some way as Open. Anyone can help in this. This
will give us a corpus of papers that we can develop and test on.In that way
we can develop a per-publisher technology. That's probably about 100. Then
we summarize the information. It would be useful to have something like:

Publisher name:
Provides HTML?
HTML figure format (PNG, JPG, etc...)
HTML table format (PNG, HTML, XL, CSV ...)
HTML References
HTML Bibliographic Metadata

BIBTEX metadata

PDF?
PDF fonts used (this will come out of AMI)
PDF non-Unicode fonts (this will come out of AMI)
PDF has Vector Graphics (also out of AMI)

How are math equations provided (GIF, TeX, MathematicalPI, ...)

Supplemental data?
Supp outside firewall?
Supp data formats (CIF, XL, Word, etc)

Licence and or restriction for Open publications

Crawling URLs and strategy (i.e. how do we find papers).

If we summarize this it makes it much easier for anyone wanting to find
information in a lot of journals. For example if we are looking for
geological terms we may want to crawl Nature, Science, Amer. Mineral, etc.
It's very useful if people have done the exploration and listed it. already

P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121104/152256d7/attachment-0001.html>