[open-science] Text mining, PDF to text conversion, and permissions on abstracts
Finn Årup Nielsen
fn at imm.dtu.dk
Thu Mar 8 09:36:44 UTC 2012
In relation to text mining:
What do people use for converting PDF to text? My default was/is
'pdftotext' but it has some issues, e.g., ligatures, greek characters,
whitespaces. I have looked at pyPdf which might be promising as it is
easier (for me) to modify the extractText method. A-PDF GUI program
didn't work on my Ubuntu Wine. Adobe Acrobat had the same issues as
pdftotext and also there is a two-column issue and it is not a CLI
program. I have some notes here: http://neuro.imm.dtu.dk/wiki/PDF
Following Todd Vision's "text-mining restrictions redux" email:
What about abstracts from full text papers? Does anyone know how
publishers feel about their abstracts? Can we republish them? Is that
fair use? Are they CC-BY-NC or perhaps even CC-BY? I cannot find any
explicit remark about that from the publishers.
Joe Dunckley
http://journalology.blogspot.com/2010/05/why-you-cant-copy-abstracts-into.html
http://friendfeed.com/yokofakun/0795d1b5/abstract-of-article-is-it-in-public-domain-true
http://www.sciencedirect.com/science/article/pii/S1053811909005990
Finn Årup Nielsen
More information about the open-science
mailing list