[open-science] Content Mining Workshop

Jack Park jackpark at gmail.com
Tue Nov 6 14:30:12 UTC 2012


Hi Jenny,

May I suggest that you run such an inquiry in DebateGraph [1] rather
than email? It's a structured conversation mechanism for organizing
facts, opinions, links, and so forth. People can subscribe to get
emails of changes to the conversation.

Cheers
Jack
[1] http://debategraph.org/

On Tue, Nov 6, 2012 at 5:32 AM, Jenny Molloy <jcmcoppice12 at gmail.com> wrote:
> Thanks for your responses everyone, I think it's clear we've identified a
> need.
> I'm in the process of planning the next Open Science Hackday and I think it
> would be good to focus on content mining to try and get some of these
> resources off the ground, particularly as there are non-technical aspects to
> which many people can contribute. So in terms of next steps:
>
> 1) Organise content mining themed open science hackday, to include a sprint
> on content for community sites [early Dec]
>
> Topics for inclusion:
> * introductions to what is technically possible
> * overview of protocols
> * lists of existing software (crawlers, parsers, semantics, machine
> learning...)
> * lists of existing online resources (mining sites, name servers,
> geotaggers, etc.)
> * how to link results
> * a list of research questions/projects that involve or could benefit from
> content mining
>
> 2) Crowdsource AMI2 development corpus - please contribute!
> (Peter - I've copied your fields into a googledoc with the publishers from
> Ross' open access spreadsheet)
>
> https://docs.google.com/spreadsheet/ccc?key=0AtV3tIqIu0UZdFI5U3hRZVl0S05ZM0hNS3NYYTdFYnc
>
> 3) Organise content mining workshop and make content available online
> [planning from Jan, run Mar/Apr]
>
> I hope this sounds good, I'll let you know as soon as I've sorted a date for
> the hackday and start booting up some documents for people to contribute to
> remotely or in their own time.
>
> Jenny
>
>
>
> On Sun, Nov 4, 2012 at 9:15 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>>
>> I had a request today asking if there was anywhere simple that people
>> could start reading about content-mining. I think the answer is no. I don't
>> even think there are good community sites for specialised text-mining and
>> data-mining. So this makes it even more important to create some.
>>
>> The important thing to realise is that many aspects are not deeply
>> technical. A lot requires us to collate fragmented knowledge. Here's one
>> where I would welcome input from anyone.
>>
>>  So one of our first tasks in building AMI2 (intelligent reader of STM
>> literature) is to find a recent paper from each publisher that is "Open
>> Access" - i.e. labelled in some way as Open. Anyone can help in this. This
>> will give us a corpus of papers that we can develop and test on.In that way
>> we can develop a per-publisher technology. That's probably about 100. Then
>> we summarize the information. It would be useful to have something like:
>>
>> Publisher name:
>> Provides HTML?
>> HTML figure format (PNG, JPG, etc...)
>> HTML table format (PNG, HTML, XL, CSV ...)
>> HTML References
>> HTML Bibliographic Metadata
>>
>> BIBTEX metadata
>>
>> PDF?
>> PDF fonts used (this will come out of AMI)
>> PDF non-Unicode fonts (this will come out of AMI)
>> PDF has Vector Graphics (also out of AMI)
>>
>> How are math equations provided (GIF, TeX, MathematicalPI, ...)
>>
>> Supplemental data?
>> Supp outside firewall?
>> Supp data formats (CIF, XL, Word, etc)
>>
>> Licence and or restriction for Open publications
>>
>> Crawling URLs and strategy (i.e. how do we find papers).
>>
>> If we summarize this it makes it much easier for anyone wanting to find
>> information in a lot of journals. For example if we are looking for
>> geological terms we may want to crawl Nature, Science, Amer. Mineral, etc.
>> It's very useful if people have done the exploration and listed it. already
>>
>> P.
>>
>>
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>>
>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>




More information about the open-science mailing list