[open-science] Content mining exercise for hands-on workshop

Peter Murray-Rust pm286 at cam.ac.uk
Mon Oct 28 14:22:28 GMT 2013

I'm free and can/will come. To avoid overexposing me, I can create the

I have had a great 2 days with Software carpentry and feel v confident we
can do something useful. We may need to work in pairs for those whose OS
doesn't hack it on the day.

Here's my idea:
* use BMC as corpus (primarily HTML) and choose bioscience where everyone
can feel comfortable (e.g. species)
* get people to preload simple tools (we'll use wget, grep, etc.) Linux
does this. Windows will need cygwin or better Enthought Canova BashGitHub.
I dont grok MAC but people managed it. BTW ppl are thinking of having a
SWCarpentry in Ox so there could be some useful contacts there.
* use wget to download several papers
* use grep to extract italic sections with species names in them
Then move on to advanced studies
* Tabula - I am working with these people. It's a nice tool for analyzing
PDF tables
* AMI - I and Ross will provide. We'll do phylo trees and choose 2-3 which
work and then get ppl to find others

Not sure how long you want. If this is too much it won't be wasted.

Jenny? Can you talk later today about this with just me? Only 10 mins?


On Mon, Oct 28, 2013 at 12:22 PM, Jenny Molloy <jenny.molloy at okfn.org>wrote:

> Thanks Peter!
> It will be 27 Nov (2013 programme:
> http://science.okfn.org/community/local-groups/oxford-open-science/)
> We start at 19:00 so I can stretch a session for 2-2.5 hours, if things go
> well we can look at a longer event in the future.
> It would be good to have a quick call on this! I can Skype 19:30 UTC (or
> later) Wed 30 Oct or Thu 31 Oct as a starting suggestion, we need Iain on
> the call so I've booted up a pad.
> *If anyone would like to join (even if you can't make the first meeting),
> please record your interest in the project here!*
> http://pad.okfn.org/p/content-mining-workshop
> Jenny
> On Sun, Oct 27, 2013 at 1:21 PM, Peter Murray-Rust <pm286 at cam.ac.uk>wrote:
>> Be very happy to do so - any idea of when?
>> On Sun, Oct 27, 2013 at 11:59 AM, Jenny Molloy <jenny.molloy at okfn.org>wrote:
>>> Dear Peter, Tony and Ross (and list)
>>> I am running an Oxford Open Science meeting next month on data tools and
>>> Iain Emsley has offered to talk about text mining in literature and
>>> humanities.
>>> In order to make this a hands-on session it would be great to have an
>>> exercise which is possible to complete in around an hour to give people a
>>> flavour of a piece of software using a 'safe' dataset from which we know
>>> they'll get something. It doesn't necessarily matter what the content is,
>>> it doesn't need to be scholarly   Ross suggested TwitR as a good entry
>>> level in a short session but I'm considering making it a longer event or
>>> running another session at a later date so maybe a data expedition in the
>>> style of School of Data is a possibility!
>>> Would anyone be willing to help come up come up with an exercise?
>>> It would be great to use this as the starting point for more educational
>>> materials on content mining.
>>> Jenny
--
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
+44-1223-763069

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
