[open-science] Content Mining Workshop

Jenny Molloy jcmcoppice12 at gmail.com
Tue Nov 6 13:32:35 UTC 2012


Thanks for your responses everyone, I think it's clear we've identified a
need.
I'm in the process of planning the next Open Science Hackday and I think it
would be good to focus on content mining to try and get some of these
resources off the ground, particularly as there are non-technical aspects
to which many people can contribute. So in terms of next steps:

1) Organise content mining themed open science hackday, to include a sprint
on content for community sites [early Dec]

Topics for inclusion:
* introductions to what is technically possible
* overview of protocols
* lists of existing software (crawlers, parsers, semantics, machine
learning...)
* lists of existing online resources (mining sites, name servers,
geotaggers, etc.)
* how to link results
* a list of research questions/projects that involve or could benefit from
content mining

2) Crowdsource AMI2 development corpus - please contribute!
(Peter - I've copied your fields into a googledoc with the publishers from
Ross' open access spreadsheet)

https://docs.google.com/spreadsheet/ccc?key=0AtV3tIqIu0UZdFI5U3hRZVl0S05ZM0hNS3NYYTdFYnc

3) Organise content mining workshop and make content available online
[planning from Jan, run Mar/Apr]

I hope this sounds good, I'll let you know as soon as I've sorted a date
for the hackday and start booting up some documents for people to
contribute to remotely or in their own time.

Jenny



On Sun, Nov 4, 2012 at 9:15 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> I had a request today asking if there was anywhere simple that people
> could start reading about content-mining. I think the answer is no. I don't
> even think there are good community sites for specialised text-mining and
> data-mining. So this makes it even more important to create some.
>
> The important thing to realise is that many aspects are not deeply
> technical. A lot requires us to collate fragmented knowledge. Here's one
> where I would welcome input from anyone.
>
>  So one of our first tasks in building AMI2 (intelligent reader of STM
> literature) is to find a recent paper from each publisher that is "Open
> Access" - i.e. labelled in some way as Open. Anyone can help in this. This
> will give us a corpus of papers that we can develop and test on.In that way
> we can develop a per-publisher technology. That's probably about 100. Then
> we summarize the information. It would be useful to have something like:
>
> Publisher name:
> Provides HTML?
> HTML figure format (PNG, JPG, etc...)
> HTML table format (PNG, HTML, XL, CSV ...)
> HTML References
> HTML Bibliographic Metadata
>
> BIBTEX metadata
>
> PDF?
> PDF fonts used (this will come out of AMI)
> PDF non-Unicode fonts (this will come out of AMI)
> PDF has Vector Graphics (also out of AMI)
>
> How are math equations provided (GIF, TeX, MathematicalPI, ...)
>
> Supplemental data?
> Supp outside firewall?
> Supp data formats (CIF, XL, Word, etc)
>
> Licence and or restriction for Open publications
>
> Crawling URLs and strategy (i.e. how do we find papers).
>
> If we summarize this it makes it much easier for anyone wanting to find
> information in a lot of journals. For example if we are looking for
> geological terms we may want to crawl Nature, Science, Amer. Mineral, etc.
> It's very useful if people have done the exploration and listed it. already
>
> P.
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121106/f3d4e115/attachment-0001.html>


More information about the open-science mailing list