[School-of-data] PDF vs ePUB vs Open Data, was: PDF Extraction Tools
pm286 at cam.ac.uk
Mon Feb 11 20:36:22 UTC 2013
On Mon, Feb 11, 2013 at 7:22 PM, M. Fioretti <mfioretti at nexaima.net> wrote:
> On Thu, Dec 20, 2012 11:54:38 AM +0100, Michael Bauer wrote:
> > Hi,
> > For the next School of Data tutorial I would like to cover Data
> > extraction from PDFs (and text) as well as OCR. Does anyone here
> > have experience with Tools that do not require coding skills to
> > extract data and text from PDFs?
> Our PDF2SVG progresses well - we have largely cracked the problem of rogue
encodings. We are working to create running text (XHTML) from this (
bitbucket.org/petermr/pdf2svg). I have a prototype that will extract graphs
from PDFs as long as they are in PS/vector (we may be able to do bitmaps
> My question is: from an automatic data extraction point of view, what
> is better between ePUB and PDF? In other words, what should an Open
> Data activist, interested in that kindof data processing, accept or
> recommend as a format for ebooks from public administrations?
PDF is not semantic and in the worst case nothing can be extracted - not
even the characters. I would recommend that it is NEVER used to convey
> My feeling is that ePUB is better, but I am not sure. What do you
I am not very familiar with ePUB but I believe that it always has to have
some semantics (e.g. HTML5). And that the words and paragraphs are always
identifiable (unlike PDF).
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the school-of-data