[ddj] Unlocking PDF data

M. Edward (Ed) Borasky znmeb at znmeb.net
Sun May 26 03:45:15 UTC 2013


If the PDF is text-based and not scanned, you can sometimes open it in
a PDF reader (evince or okular on Linux, Acrobat Reader on Windows)
and copy-paste the text tables right into Excel! You may have to do a
text column split and adjust some rows after the paste, but it's worth
a try.

I've got pretty much every open source PDF data extraction tool
available in my Computational Journalism Publishers Workbench (Fedora
and Ubuntu Linux). For scanned PDFs, you'll need an optical character
recognition tool - I use Tesseract.

On Sat, May 25, 2013 at 7:28 PM, Greg Barila <gregbarila at gmail.com> wrote:
> Hi there. I'm a journalist based in Adelaide, South Australia. I've been
> dabbling in some simple data journalism projects over the past couple of
> years (see some examples here: http://adelaidedatablog.tumblr.com )
>
> I'm interested - does anybody know of a good, open-source tool for
> converting PDFs into editable documents, preferably excel?
>
> I know about tools like Tabula - but it appears the tool is experimental and
> not available for general use.
>
> Any tips would be appreciated.
>
> Greg
> (@GregBarila)
>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>



-- 
Twitter: http://twitter.com/znmeb; Computational Journalism Publishers Workbench
http://j.mp/CompJournBench/

Get out of the building - and don't come back till you have the order!




More information about the data-driven-journalism mailing list