[open-science-dev] [Open-access] Fwd: [open-science] fw: Python NLTK/data mining/machine learning project of public research data, anyone interested?
Peter Murray-Rust
pm286 at cam.ac.uk
Mon Aug 20 13:19:05 UTC 2012
On Mon, Aug 20, 2012 at 2:04 PM, Laurent Romary <laurent.romary at inria.fr>wrote:
>
>
> It depends what you mean by crawler. Can you say more about this?
>
>
> Recursively crawls publisher->journal->issue->article
>
>
> This we do not have.
>
>
Great!! then we interface directly
>
>
>>
>> In PEER, I was the one to develop the XSLT stylesheets from the various
>> publishers' formats (ScholarOne, various versions of NLM, Elsevier, Nature,
>> ...) to TEI. I have never managed to put this together in SF, but could zip
>> this to however would want to push things further.
>>
>> This assumes that one has XML. I am working on the assumption that we
> have the PDF only (and that's an advantage for getting the material out of
> diagrams)
>
>
> We actually worked on both scenario in PEER. So the software on SF work
> directly with PDFs and the stylesheets are there because we also got a huge
> amount of data directly from publishers.
>
Problem with material from publishers is that it is usually a one-off
provision of material and there are often legal constraints
P.
> Laurent
>
>
> P.
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>
>
> Laurent Romary
> INRIA & HUB-IDSL
> laurent.romary at inria.fr
>
>
>
>
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science-dev/attachments/20120820/8a58c75c/attachment.html>
More information about the open-science-dev
mailing list