[ddj] Lagarde List

Sam Leon sam.leon at okfn.org
Wed Mar 4 17:00:10 UTC 2015


Hi Andrea,

I wish I could say it was difficult job, but actually it  was actually
pretty straight forward! The heavy lifting was done Abby Finereader Online
<http://www.finereaderonline.com/en-us> which lets you get high-quality OCR
direct to Excel or Word for documents. It costs €4 for 200 pages (although
this only used 52 pages) and took about 30 minutes.

It's a really great OCR service and is going to change the way I work as
I'm very often wrangling data trapped in scans.

I don't have stats on how successful the OCR was, you can see from some
parts of the Excel that the Greek characters get a bit muddled in parts. I
wasn't using these, so that didn't pose a serious issue for me.

Sam

On 4 March 2015 at 15:29, Andrea Nelson Mauro <andrea.nelson.mauro at gmail.com
> wrote:

> Hi Sam,
>
> What a great example of scraping! Can you share some stats of your work
> (like range of mistakes, time to do)? I've some friends from Wikipedia who
> would be very interested on the tool!
>
> -
> sorry for typos, sent by mobile
> -
> Andrea Nelson Mauro
> @nelsonmau
> dataninja.it
> -
>  Il giorno 04/mar/2015 16:16, "Sam Leon" <sam.leon at okfn.org> ha scritto:
>
> Hi All,
>>
>> In case anyone was interested, I ended up using ABBY Finereader Online
>> to OCR the PDF <http://www.finereaderonline.com/en-us> (thanks for the
>> recommendation Friedrich!) . Gave fairly good results, which were good
>> enough for my purposes.
>>
>> I've attached the end product in case useful to anyone else.
>>
>> Cheers,
>>
>> Sam
>>
>> On 12 February 2015 at 09:19, Sam Leon <sam.leon at okfn.org> wrote:
>>
>>> Thank you Victoria and Theresa, it's not immediately clear if the public
>>> ICIJ data contains the names from the Lagarde List. I'll dig in later this
>>> week and report my findings here.
>>>
>>> Sam
>>>
>>> On 11 February 2015 at 12:50, Victoria Parsons <
>>> victoria.megan.parsons at googlemail.com> wrote:
>>>
>>>> I haven't had a proper look but what about this link from the comments
>>>> under the article: http://icij-uploads.s3-website-us-
>>>> east-1.amazonaws.com/2013/10/offshore/csv.zip
>>>>
>>>> Vic
>>>>
>>>> --
>>>>
>>>> Victoria Parsons
>>>> The Bureau of Investigative Journalism
>>>> The Myddleton Building, 167-173 Goswell Road
>>>> London  EC1V 7HD+44 (0)20 7040 0095
>>>> @vicparsons_ <https://twitter.com/vicparsons_>
>>>> Public key <http://bit.ly/1BpOrVG>
>>>>
>>>>
>>>> Follow us on Twitter <https://twitter.com/TBIJ>
>>>> Like us on Facebook <https://www.facebook.com/thebureauinvestigates>
>>>> Find us on LinkedIn <http://www.linkedin.com/company/the-bureau-of-investigative-journalism>
>>>> Sign up for email alerts <http://eepurl.com/IUtYL> from the Bureau's covert drone war investigation
>>>>
>>>>
>>>> On Wed, Feb 11, 2015 at 12:18 PM, Theresa Mallinson <
>>>> theresa.mallinson at gmail.com> wrote:
>>>>
>>>>> Hmmm, or maybe not... Just read comments under the article. But it
>>>>> seems people are getting together to pressure ICIJ to release list. Does
>>>>> Tabula not work for scanned PDFs?
>>>>>
>>>>> *Theresa Mallinson*
>>>>> Assistant editor at *The Daily Vox <http://www.thedailyvox.co.za>*
>>>>> *@tcmallinson <http://www.twitter.com/tcmallinson>*
>>>>> +27 76 673 4076
>>>>> Subscribe to me on *Beacon *
>>>>> <http://www.beaconreader.com/theresa-mallinson>
>>>>>
>>>>> On 11 February 2015 at 14:15, Theresa Mallinson <
>>>>> theresa.mallinson at gmail.com> wrote:
>>>>>
>>>>>> I haven't had a chance to look at this properly yet, but could help?
>>>>>> http://www.icij.org/project/swiss-leaks/explore-swiss-leaks-data At
>>>>>> very least, ICIJ should be able to help you out w/ list?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> *Theresa Mallinson*
>>>>>> Assistant editor at *The Daily Vox <http://www.thedailyvox.co.za>*
>>>>>> *@tcmallinson <http://www.twitter.com/tcmallinson>*
>>>>>> +27 76 673 4076
>>>>>> Subscribe to me on *Beacon *
>>>>>> <http://www.beaconreader.com/theresa-mallinson>
>>>>>>
>>>>>> On 11 February 2015 at 14:11, Sam Leon <sam.leon at okfn.org> wrote:
>>>>>>
>>>>>>> Does anyone have a machine-readable copy of the "Lagarde List
>>>>>>> <http://en.wikipedia.org/wiki/Lagarde_list>" they could share?
>>>>>>>
>>>>>>> I can only find the scanned PDFs that have been published...
>>>>>>>
>>>>>>> http://www.protothema.gr/files/1/2013/03/21/lagarde-list.pdf
>>>>>>>
>>>>>>> Sam
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> *Sam LeonSenior analyst & trainer | skype: samedleon  |  @Noel_Mas
>>>>>>> <https://twitter.com/noel_mas>The Open Knowledge Foundation
>>>>>>> <http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
>>>>>>> <http://okfn.org/>  |  @okfn <http://twitter.com/OKFN>  |  OKF on Facebook
>>>>>>> <https://www.facebook.com/OKFNetwork>  |  Blog <http://blog.okfn.org/>  |
>>>>>>>  Newsletter <http://okfn.org/about/newsletter>*
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> data-driven-journalism mailing list
>>>>>>> data-driven-journalism at lists.okfn.org
>>>>>>> https://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>>>>>> Unsubscribe:
>>>>>>> https://lists.okfn.org/mailman/options/data-driven-journalism
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> data-driven-journalism mailing list
>>>>> data-driven-journalism at lists.okfn.org
>>>>> https://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>>>> Unsubscribe:
>>>>> https://lists.okfn.org/mailman/options/data-driven-journalism
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> data-driven-journalism mailing list
>>>> data-driven-journalism at lists.okfn.org
>>>> https://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>>> Unsubscribe:
>>>> https://lists.okfn.org/mailman/options/data-driven-journalism
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Sam LeonSenior analyst & trainer | skype: samedleon  |  @Noel_Mas
>>> <https://twitter.com/noel_mas>The Open Knowledge Foundation
>>> <http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
>>> <http://okfn.org/>  |  @okfn <http://twitter.com/OKFN>  |  OKF on Facebook
>>> <https://www.facebook.com/OKFNetwork>  |  Blog <http://blog.okfn.org/>  |
>>>  Newsletter <http://okfn.org/about/newsletter>*
>>>
>>
>>
>>
>> --
>>
>> *Sam LeonSenior analyst & trainer | skype: samedleon  |  @Noel_Mas
>> <https://twitter.com/noel_mas>The Open Knowledge Foundation
>> <http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
>> <http://okfn.org/>  |  @okfn <http://twitter.com/OKFN>  |  OKF on Facebook
>> <https://www.facebook.com/OKFNetwork>  |  Blog <http://blog.okfn.org/>  |
>>  Newsletter <http://okfn.org/about/newsletter>*
>>
>> _______________________________________________
>> data-driven-journalism mailing list
>> data-driven-journalism at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/data-driven-journalism
>> Unsubscribe:
>> https://lists.okfn.org/mailman/options/data-driven-journalism
>>
>>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: https://lists.okfn.org/mailman/options/data-driven-journalism
>
>


-- 

*Sam LeonSenior analyst & trainer | skype: samedleon  |  @Noel_Mas
<https://twitter.com/noel_mas>The Open Knowledge Foundation
<http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
<http://okfn.org/>  |  @okfn <http://twitter.com/OKFN>  |  OKF on Facebook
<https://www.facebook.com/OKFNetwork>  |  Blog <http://blog.okfn.org/>  |
 Newsletter <http://okfn.org/about/newsletter>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20150304/8aa370fa/attachment-0003.html>


More information about the data-driven-journalism mailing list