[School-of-data] PDF vs ePUB vs Open Data, was: PDF Extraction Tools

M. Fioretti mfioretti at nexaima.net
Mon Feb 11 19:22:16 UTC 2013

On Thu, Dec 20, 2012 11:54:38 AM +0100, Michael Bauer wrote:
> Hi,
> For the next School of Data tutorial I would like to cover Data
> extraction from PDFs (and text) as well as OCR.  Does anyone here
> have experience with Tools that do not require coding skills to
> extract data and text from PDFs?


looking at all the detailed discussion on this topic, I am confident
that members of this list will have lots of informed info also on
a somewhat related topic.

ebooks are rapidly gaining popularity. It is safe to assume that much
public reports, laws, minutes and other documents will become also
available in ebook formats in the next years.

So far, most of the discussion about automatic parsing of government
data has been about stuff that was either spreadsheets or simple text

My question is: from an automatic data extraction point of view, what
is better between ePUB and PDF? In other words, what should an Open
Data activist, interested in that kindof data processing, accept or
recommend as a format for ebooks from public administrations?

My feeling is that ePUB is better, but I am not sure. What do you


More information about the school-of-data mailing list