[open-science] Text mining, PDF to text conversion, and permissions on abstracts

Thu Mar 8 09:36:44 UTC 2012

In relation to text mining:

What do people use for converting PDF to text? My default was/is 
'pdftotext' but it has some issues, e.g., ligatures, greek characters, 
whitespaces. I have looked at pyPdf which might be promising as it is 
easier (for me) to modify the extractText method. A-PDF GUI program 
didn't work on my Ubuntu Wine. Adobe Acrobat had the same issues as 
pdftotext and also there is a two-column issue and it is not a CLI 
program. I have some notes here: http://neuro.imm.dtu.dk/wiki/PDF

Following Todd Vision's "text-mining restrictions redux" email:

What about abstracts from full text papers? Does anyone know how 
publishers feel about their abstracts? Can we republish them? Is that 
fair use? Are they CC-BY-NC or perhaps even CC-BY? I cannot find any 
explicit remark about that from the publishers.

Joe Dunckley
http://journalology.blogspot.com/2010/05/why-you-cant-copy-abstracts-into.html

http://friendfeed.com/yokofakun/0795d1b5/abstract-of-article-is-it-in-public-domain-true

http://www.sciencedirect.com/science/article/pii/S1053811909005990

Finn Årup Nielsen