[ddj] scraping data with a bookmarklet: Convextra

Tommy Kaas tommybirchkaas at gmail.com
Fri Apr 12 12:08:04 UTC 2013


Yásh,
An OCR application can be a good option. I have extracted lots of ill
formed tables from pdf files with ABBYY FineReader.
Tommy


2013/4/12 Yasha Ac <yasha.ac at gmail.com>

>
> Hello all,
>
> As always, great thread. Does any one know of good software to scrape
> large .pdf files. We're talking reports with 100+ pages, lots of text and
> sporadic (ill formed) tables.
>
> Many thanks
>
>  - Yásh
>
>
> On Fri, Apr 12, 2013 at 10:29 AM, Friedrich Lindenberg <
> friedrich.lindenberg at okfn.org> wrote:
>
>> Hey,
>>
>> On Fri, Apr 12, 2013 at 11:01 AM, mirko.lorenz at gmail.com <
>> mirko.lorenz at gmail.com> wrote:
>>
>>> With Needlebase you where able to define what needed to be scraped from
>>> an overview page (e.g. list of all parliament members in Germany), then you
>>> could define what links to follow and what fields to scrape. With a bit of
>>> effort you created a script for multipage scraping and could sent out the
>>> application to collect the defined information - all the results where then
>>> collected into a table. Neat.
>>
>>
>> I was playing with the new ScraperWiki prototype yesterday (yay!) and
>> they've made this much more DIY/hard-core: you can basically make your own
>> apps on top of their platform now. I'm wondering if one could build a
>> Needlebase-style WYSIWYG multi-level scraper on top of it. Basically: fetch
>> remote web sites, add JS to allow highlighting of data fields and then
>> extract data automatically based on these patterns.
>>
>> Would anybody be up for exploring this?
>>
>> - Friedrich
>>
>>
>>
>>>
>>> Kind of amazing that there was a good solution (as said in another mail:
>>> maybe a bit too good) and that it simply was taken down. There was no real
>>> communication why that happened, I had contacted the team directly even.
>>>
>>> Point is: There would be an opportunity here, although I am not sure
>>> whether that could be sustainable.
>>>
>>> /Mirko
>>>
>>>
>>> 2013/4/12 Michael Bauer <michael.bauer at okfn.org>
>>>
>>>> One thing that struck me interesting with Convextra is the multi-page
>>>> scraping it does. This was unseen to me (never used needlebase though).
>>>>
>>>> Michael
>>>>
>>>> On Thu, Apr 11, 2013 at 02:56:37PM +0200, mirko.lorenz at gmail.com wrote:
>>>> > I wish we would have Needlebase back. Would have solved a lot of
>>>> issues,
>>>> > but was probably too good, e.g. it was possible to scrape page by
>>>> page with
>>>> > relative ease.
>>>> >
>>>> > 2013/4/11 <SMachlis at computerworld.com>
>>>> >
>>>> > > Agreed, although if you're only scraping a couple of pages it's not
>>>> too
>>>> > > much of a problem to select all, copy, and paste into a local
>>>> spreadsheet.
>>>> > >
>>>> > > ________________________________________
>>>> > >
>>>> > > Indeed and I would still recommend it over a purely web based
>>>> service. It
>>>> > > would be great if the scraper extension would allow local saving -
>>>> instead
>>>> > > of google docs export.
>>>> > >
>>>> > > Michael
>>>> > >
>>>> > >
>>>> > > _______________________________________________
>>>> > > data-driven-journalism mailing list
>>>> > > data-driven-journalism at lists.okfn.org
>>>> > > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>>> > > Unsubscribe:
>>>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>>> > >
>>>>
>>>> > _______________________________________________
>>>> > data-driven-journalism mailing list
>>>> > data-driven-journalism at lists.okfn.org
>>>> > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>>> > Unsubscribe:
>>>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>>>
>>>>
>>>> --
>>>> Data Wrangler with the Open Knowledge Foundation (OKFN.org)
>>>> GPG/PGP key: http://tentacleriot.eu/mihi.asc
>>>> Twitter: @mihi_tr Skype: mihi_tr
>>>>
>>>> _______________________________________________
>>>> data-driven-journalism mailing list
>>>> data-driven-journalism at lists.okfn.org
>>>> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>>> Unsubscribe:
>>>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>>>
>>>
>>>
>>> _______________________________________________
>>> data-driven-journalism mailing list
>>> data-driven-journalism at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>> Unsubscribe:
>>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>>
>>>
>>
>> _______________________________________________
>> data-driven-journalism mailing list
>> data-driven-journalism at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>> Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>>
>>
>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>
>


-- 
Tommy Kaas
Journalist og partner,

Kaas & Mulvad
Lykkesholms Alle 2A, 3
1902 Frederiksberg
tlf. 27268818
e-mail: tommy.kaas at kaasogmulvad.dk
Twitter: @tbkaas
web: www.kaasogmulvad.dk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20130412/32417f3f/attachment-0001.html>


More information about the data-driven-journalism mailing list