[School-of-data] PDF vs ePUB vs Open Data, was: PDF Extraction Tools

Mon Feb 11 20:36:22 UTC 2013

On Mon, Feb 11, 2013 at 7:22 PM, M. Fioretti <mfioretti at nexaima.net> wrote:

> On Thu, Dec 20, 2012 11:54:38 AM +0100, Michael Bauer wrote:
> > Hi,
> >
> > For the next School of Data tutorial I would like to cover Data
> > extraction from PDFs (and text) as well as OCR.  Does anyone here
> > have experience with Tools that do not require coding skills to
> > extract data and text from PDFs?
>
> Our PDF2SVG progresses well - we have largely cracked the problem of rogue
encodings. We are working to create running text (XHTML) from this (
bitbucket.org/petermr/pdf2svg). I have a prototype that will extract graphs
from PDFs as long as they are in PS/vector (we may be able to do bitmaps
later).

>
> My question is: from an automatic data extraction point of view, what
> is better between ePUB and PDF? In other words, what should an Open
> Data activist, interested in that kindof data processing, accept or
> recommend as a format for ebooks from public administrations?
>

PDF is not semantic and in the worst case nothing can be extracted - not
even the characters. I would recommend that it is NEVER used to convey
information.

>
> My feeling is that ePUB is better, but I am not sure. What do you
> think?
>

I am not very familiar with ePUB but I believe that it always has to have
some semantics (e.g. HTML5). And that the words and paragraphs are always
identifiable (unlike PDF).

>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20130211/9ea8538f/attachment-0001.html>