[ddj] Unlocking PDF data

Peter Borbely pborbely at fairfaxmedia.com.au
Mon May 27 04:55:10 UTC 2013


Greg

The tools may slightly vary depending on the OS you're using and also
on how the PDF was created at the first place.
If you have Acrobat Pro, you'll get a lot of great tools to export
from PDF, including character recognition (to convert images (e.g.
scanned text) to alphanumerical data). Google docs may do that for you
as well (with limited file sizes, however).

Open Office is probably your next port of call, see this thread:
http://forum.openoffice.org/en/forum/viewtopic.php?t=43632 and
googling will reveal thousands of more resources.
You may find that you need specific tools / scripts particular to the
semantic structure (or lack of it way too often) of your particular
PDF.

cheers
Peter


On 26 May 2013 13:19, <data-driven-journalism-request at lists.okfn.org> wrote:
>
> Send data-driven-journalism mailing list submissions to
>         data-driven-journalism at lists.okfn.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> or, via email, send a message with subject or body 'help' to
>         data-driven-journalism-request at lists.okfn.org
>
> You can reach the person managing the list at
>         data-driven-journalism-owner at lists.okfn.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of data-driven-journalism digest..."
>
>
> Today's Topics:
>
>    1. Re: Tax Avoidance and Evasion Data Expedition
>       (Square One Dr Peter Troxler (KvK 24480536))
>    2. Unlocking PDF data (Greg Barila)
>    3. Re: Unlocking PDF data (Jesus Lopez Osorio)
>    4. Re: Unlocking PDF data (Andrew Duffy)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 25 May 2013 19:22:53 +0200
> From: "Square One Dr Peter Troxler (KvK 24480536)" <peter at square-1.eu>
> Subject: Re: [ddj] Tax Avoidance and Evasion Data Expedition
> To: "List about Data Driven Journalism and Open Data in Journalism."
>         <data-driven-journalism at lists.okfn.org>
> Cc: "schoolofdata at okfn.org" <schoolofdata at okfn.org>
> Message-ID: <430A5CCF-1CCF-44AA-B315-F6D4672DDB18 at square-1.eu>
> Content-Type: text/plain; charset="windows-1252"
>
> I would wish you'd make that a key topic at OKConf in Geneva ? but I doubt you'll be able to push that past the corporate agenda of the event?
>
> On 24 May 2013, at 19:07 , Lucy Chambers <lucy.chambers at okfn.org> wrote:
>
> > Hi All,
> >
> > The topic of tax avoidance and tax evasion is the topic of the day at the Open Knowledge Foundation.
> >
> > On 6th of June the School of Data will be running a data expedition (a group-based guided journey - picking some key questions, then seeing whether it is possible to answer them) on the topic of tax evasion to help those reporting on the topic to get a grasp of key concepts and schemes. This will take place online.
> >
> > Places are limited as this will be quite hands on - please sign up early if you would like to be involved (you must be able to dedicate at least 3-6 hours on 6th June).
> >
> > More details and signup here:
> >
> > http://schoolofdata.org/2013/05/24/data-expedition-tax-avoidance-and-evasion/
> >
> > Note: In future, for data expeditions - we will try and kick off data expeditions with "expert introductions", with people who are knowledgable about a particular topic giving a short 15-30 minute introduction via videolink / recording, or even a Q & A. If you know someone we should invite to take the stage on tax evasion or tax avoidance, please let us know!
> >
> > All the best,
> >
> > Lucy
> >
> >
> >
> >
> >
> > --
> > Lucy Chambers
> >
> > Project Coordinator  | skype: lucyfediachambers  |  tel: +44 7909 330731  |  @lucyfedia
> >
> > The Open Knowledge Foundation
> > Empowering through Open Knowledge
> > http://okfn.org/  |  @okfn  |  OKF on Facebook  |  Blog  |  Newsletter
> >
> > OpenSpending | http://openspending.org/ | @openspending |  Tracking every government financial transaction across the world
> > School of Data | http://schoolofdata.org | @schoolofdata | Evidence is Power
> > _______________________________________________
> > data-driven-journalism mailing list
> > data-driven-journalism at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> > Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20130525/f27f4a34/attachment-0001.htm>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 26 May 2013 11:58:17 +0930
> From: Greg Barila <gregbarila at gmail.com>
> Subject: [ddj] Unlocking PDF data
> To: data-driven-journalism at lists.okfn.org
> Message-ID:
>         <CAFv_f8QXHhcVDT7NAY+UEFkad3CKGQVdyf+BEYx8ssQkd_WJfQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi there. I'm a journalist based in Adelaide, South Australia. I've been
> dabbling in some simple data journalism projects over the past couple of
> years (see some examples here: http://adelaidedatablog.tumblr.com )
>
> I'm interested - does anybody know of a good, open-source tool for
> converting PDFs into editable documents, preferably excel?
>
> I know about tools like Tabula - but it appears the tool is experimental
> and not available for general use.
>
> Any tips would be appreciated.
>
> Greg
> (@GregBarila)
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20130526/47b2a4dd/attachment-0001.htm>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 26 May 2013 04:43:12 +0200
> From: Jesus Lopez Osorio <jesuslopez.osorio at elmundo.es>
> Subject: Re: [ddj] Unlocking PDF data
> To: List about Data Driven Journalism and Open Data in Journalism.
>         <data-driven-journalism at lists.okfn.org>
> Message-ID:
>         <B14C05A4B058424D8AD8CE2D3C3AAACC0143BDCFDA1C at UE-MAILCCR.oficina.int>
> Content-Type: text/plain; charset="us-ascii"
>
> Have you tried Zamzar? It worked for me once
> good luck down under!
>
> De: data-driven-journalism-bounces at lists.okfn.org [mailto:data-driven-journalism-bounces at lists.okfn.org] En nombre de Greg Barila
> Enviado el: domingo, 26 de mayo de 2013 4:28
> Para: data-driven-journalism at lists.okfn.org
> Asunto: [ddj] Unlocking PDF data
>
> Hi there. I'm a journalist based in Adelaide, South Australia. I've been dabbling in some simple data journalism projects over the past couple of years (see some examples here: http://adelaidedatablog.tumblr.com )
>
> I'm interested - does anybody know of a good, open-source tool for converting PDFs into editable documents, preferably excel?
>
> I know about tools like Tabula - but it appears the tool is experimental and not available for general use.
>
> Any tips would be appreciated.
>
> Greg
> (@GregBarila)
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20130526/57f4142f/attachment-0001.htm>
>
> ------------------------------
>
> Message: 4
> Date: Sun, 26 May 2013 11:19:28 +0800
> From: Andrew Duffy <andrewjamesduffy at gmail.com>
> Subject: Re: [ddj] Unlocking PDF data
> To: "List about Data Driven Journalism and Open Data in Journalism."
>         <data-driven-journalism at lists.okfn.org>
> Message-ID:
>         <CAO8PDYQcbiJ2LmzYgLiynWruCHurQ0g0B4mfELgxfMRLGcJ9QQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Last time I checked Google Drive could convert PDFs into editable documents
> if you tell it to.
>
> --Andrew Duffy (from Perth)
>
>
> On Sun, May 26, 2013 at 10:28 AM, Greg Barila <gregbarila at gmail.com> wrote:
>
> > Hi there. I'm a journalist based in Adelaide, South Australia. I've been
> > dabbling in some simple data journalism projects over the past couple of
> > years (see some examples here: http://adelaidedatablog.tumblr.com )
> >
> > I'm interested - does anybody know of a good, open-source tool for
> > converting PDFs into editable documents, preferably excel?
> >
> > I know about tools like Tabula - but it appears the tool is experimental
> > and not available for general use.
> >
> > Any tips would be appreciated.
> >
> > Greg
> > (@GregBarila)
> >
> > _______________________________________________
> > data-driven-journalism mailing list
> > data-driven-journalism at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> > Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
> >
> >
>
>
> --
>
> *Andrew Duffy - Journalist*
>
> *Cirrus Media*
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20130526/581de595/attachment.htm>
>
> ------------------------------
>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: http://lists.okfn.org/mailman/optionss/data-driven-journalism
>
>
> End of data-driven-journalism Digest, Vol 26, Issue 34
> ******************************************************

-- 
The information contained in this e-mail message and any accompanying files 
is or may be confidential. If you are not the intended recipient, any use, 
dissemination, reliance, forwarding, printing or copying of this e-mail or 
any attached files is unauthorised. This e-mail is subject to copyright. No 
part of it should be reproduced, adapted or communicated without the 
written consent of the copyright owner. If you have received this e-mail in 
error please advise the sender immediately by return e-mail or telephone 
and delete all copies. Fairfax Media does not guarantee the accuracy or 
completeness of any information contained in this e-mail or attached files. 
Internet communications are not secure, therefore Fairfax Media does not 
accept legal responsibility for the contents of this message or attached 
files.




More information about the data-driven-journalism mailing list