[Open-access] [open-science-dev] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Mon Aug 20 12:55:51 UTC 2012

On Mon, Aug 20, 2012 at 1:17 PM, Laurent Romary <laurent.romary at inria.fr>wrote:

>
> Le 20 août 2012 à 13:50, Peter Murray-Rust a écrit :
>
> That was 20 years ago ;-). It now comes with a very flexible customization
> platform allowing you to dumb down things to your needs. [disclosure: I
> have chaired the TEI technical council 2008-2011]. Remember also that the
> TEI people have been part of the setting up of XML (e.g. Michael
> Sperberg-McQueen, co-editor of both the TEI and the XML rec. in late 90's).
> And it comes automatically with the Oxygen XML editor.
>
>
Indeed. I ran the XML-DEV list in 1997 so knew many people involved in TEI.

>
> 2. We (Patrice Lopez, CCed) developed an open source library of metadata,
>> but also full text extraction tools which creates a TEI based
>> representation of a pdf given as input. This could be particularly
>> interesting for your work I think, and an opportunity to share forces
>> there.  See https://sourceforge.net/projects/grobid/
>>
>
> Java - great for me! I'll checkout and look later but this seems very
> valuable. This is exactly what we need to interface with PubCrawler (unless
> you also have a crawler). We can then  download metadata from all the
> publishers - we already have scrapers for several. That gives us the
> metadata framework for then managing Open bibliography and using it.
>
>
> It depends what you mean by crawler. Can you say more about this?
>

Recursively crawls publisher->journal->issue->article

>
> In PEER, I was the one to develop the XSLT stylesheets from the various
> publishers' formats (ScholarOne, various versions of NLM, Elsevier, Nature,
> ...) to TEI. I have never managed to put this together in SF, but could zip
> this to however would want to push things further.
>
> This assumes that one has XML. I am working on the assumption that we have
the PDF only (and that's an advantage for getting the material out of
diagrams)

P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-access/attachments/20120820/66c43eae/attachment.html>