[open-science] Text mining, PDF to text conversion, and permissions on abstracts

Maximilian Haeussler maximilianh at gmail.com
Thu Mar 8 18:06:25 UTC 2012


Two years ago, I had the impression that pdfBox is the most mature software
package in this area.

hope this helps
cheers
Max



2012/3/8 Finn Årup Nielsen <fn at imm.dtu.dk>

> In relation to text mining:
>
>
> What do people use for converting PDF to text? My default was/is
> 'pdftotext' but it has some issues, e.g., ligatures, greek characters,
> whitespaces. I have looked at pyPdf which might be promising as it is
> easier (for me) to modify the extractText method. A-PDF GUI program didn't
> work on my Ubuntu Wine. Adobe Acrobat had the same issues as pdftotext and
> also there is a two-column issue and it is not a CLI program. I have some
> notes here: http://neuro.imm.dtu.dk/wiki/**PDF<http://neuro.imm.dtu.dk/wiki/PDF>
>
>
> Following Todd Vision's "text-mining restrictions redux" email:
>
> What about abstracts from full text papers? Does anyone know how
> publishers feel about their abstracts? Can we republish them? Is that fair
> use? Are they CC-BY-NC or perhaps even CC-BY? I cannot find any explicit
> remark about that from the publishers.
>
> Joe Dunckley
> http://journalology.blogspot.**com/2010/05/why-you-cant-copy-**
> abstracts-into.html<http://journalology.blogspot.com/2010/05/why-you-cant-copy-abstracts-into.html>
>
> http://friendfeed.com/**yokofakun/0795d1b5/abstract-**
> of-article-is-it-in-public-**domain-true<http://friendfeed.com/yokofakun/0795d1b5/abstract-of-article-is-it-in-public-domain-true>
>
> http://www.sciencedirect.com/**science/article/pii/**S1053811909005990<http://www.sciencedirect.com/science/article/pii/S1053811909005990>
>
>
> Finn Årup Nielsen
>
> ______________________________**_________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/**listinfo/open-science<http://lists.okfn.org/mailman/listinfo/open-science>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120308/4c4626f4/attachment-0001.html>


More information about the open-science mailing list