[open-science-dev] [Open-access] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?

Laurent Romary laurent.romary at inria.fr
Mon Aug 20 13:04:01 UTC 2012




Le 20 août 2012 à 14:55, Peter Murray-Rust a écrit :
>> 
>> 2. We (Patrice Lopez, CCed) developed an open source library of metadata, but also full text extraction tools which creates a TEI based representation of a pdf given as input. This could be particularly interesting for your work I think, and an opportunity to share forces there.  See https://sourceforge.net/projects/grobid/
>> 
>> Java - great for me! I'll checkout and look later but this seems very valuable. This is exactly what we need to interface with PubCrawler (unless you also have a crawler). We can then  download metadata from all the publishers - we already have scrapers for several. That gives us the metadata framework for then managing Open bibliography and using it.
> 
> It depends what you mean by crawler. Can you say more about this?
> 
> Recursively crawls publisher->journal->issue->article

This we do not have.

>  
> 
> In PEER, I was the one to develop the XSLT stylesheets from the various publishers' formats (ScholarOne, various versions of NLM, Elsevier, Nature, ...) to TEI. I have never managed to put this together in SF, but could zip this to however would want to push things further.
> 
> This assumes that one has XML. I am working on the assumption that we have the PDF only (and that's an advantage for getting the material out of diagrams)

We actually worked on both scenario in PEER. So the software on SF work directly with PDFs and the stylesheets are there because we also got a huge amount of data directly from publishers.
Laurent

> 
> P.
>  
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Laurent Romary
INRIA & HUB-IDSL
laurent.romary at inria.fr



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20120820/ad0a15b8/attachment-0001.htm>


More information about the open-science-dev mailing list