[School-of-data] PDF vs ePUB vs Open Data, was: PDF Extraction Tools

Mon Feb 11 21:00:35 UTC 2013

Hi,

If I recall correctly, EPUB is basically a set of XHTML files (and images)
zipped in an archive. Extracting data from an EPUB document is _MUCH_
simpler than from a PDF (specially considering that in many cases the PDF
is based on a scanned document). I'd like to see more EPUB and less PDFs in
general, but there are certain communities where the latter is the de facto
standard (e.g., scientific publications, government reports, etc).

BTW, I remember some Linux command (pdftotext I think) that was really
useful to get 80% of the content of a bunch of PDFs documents. Of course,
YMMV.

Best,
Al

Alvaro Graves-Fuenzalida
Web: http://graves.cl - Twitter: @alvarograves

On Mon, Feb 11, 2013 at 12:36 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

>
>
> On Mon, Feb 11, 2013 at 7:22 PM, M. Fioretti <mfioretti at nexaima.net>wrote:
>
>> On Thu, Dec 20, 2012 11:54:38 AM +0100, Michael Bauer wrote:
>> > Hi,
>> >
>> > For the next School of Data tutorial I would like to cover Data
>> > extraction from PDFs (and text) as well as OCR.  Does anyone here
>> > have experience with Tools that do not require coding skills to
>> > extract data and text from PDFs?
>>
>> Our PDF2SVG progresses well - we have largely cracked the problem of
> rogue encodings. We are working to create running text (XHTML) from this (
> bitbucket.org/petermr/pdf2svg). I have a prototype that will extract
> graphs from PDFs as long as they are in PS/vector (we may be able to do
> bitmaps later).
>
>
>>
>> My question is: from an automatic data extraction point of view, what
>> is better between ePUB and PDF? In other words, what should an Open
>> Data activist, interested in that kindof data processing, accept or
>> recommend as a format for ebooks from public administrations?
>>
>
> PDF is not semantic and in the worst case nothing can be extracted - not
> even the characters. I would recommend that it is NEVER used to convey
> information.
>
>>
>> My feeling is that ePUB is better, but I am not sure. What do you
>> think?
>>
>
> I am not very familiar with ePUB but I believe that it always has to have
> some semantics (e.g. HTML5). And that the words and paragraphs are always
> identifiable (unlike PDF).
>
>>
>>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
> _______________________________________________
> School-of-data mailing list
> School-of-data at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: http://lists.okfn.org/mailman/options/school-of-data
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20130211/c427b651/attachment-0001.html>