[ddj] scraping data with a bookmarklet: Convextra

Yasha Ac yasha.ac at gmail.com
Fri Apr 12 11:27:32 UTC 2013


Hello all,

As always, great thread. Does any one know of good software to scrape large
.pdf files. We're talking reports with 100+ pages, lots of text and
sporadic (ill formed) tables.

Many thanks

 - Yásh


On Fri, Apr 12, 2013 at 10:29 AM, Friedrich Lindenberg <
friedrich.lindenberg at okfn.org> wrote:

> Hey,
>
> On Fri, Apr 12, 2013 at 11:01 AM, mirko.lorenz at gmail.com <
> mirko.lorenz at gmail.com> wrote:
>
>> With Needlebase you where able to define what needed to be scraped from
>> an overview page (e.g. list of all parliament members in Germany), then you
>> could define what links to follow and what fields to scrape. With a bit of
>> effort you created a script for multipage scraping and could sent out the
>> application to collect the defined information - all the results where then
>> collected into a table. Neat.
>
>
> I was playing with the new ScraperWiki prototype yesterday (yay!) and
> they've made this much more DIY/hard-core: you can basically make your own
> apps on top of their platform now. I'm wondering if one could build a
> Needlebase-style WYSIWYG multi-level scraper on top of it. Basically: fetch
> remote web sites, add JS to allow highlighting of data fields and then
> extract data automatically based on these patterns.
>
> Would anybody be up for exploring this?
>
> - Friedrich
>
>
>
>>
>> Kind of amazing that there was a good solution (as said in another mail:
>> maybe a bit too good) and that it simply was taken down. There was no real
>> communication why that happened, I had contacted the team directly even.
>>
>> Point is: There would be an opportunity here, although I am not sure
>> whether that could be sustainable.
>>
>> /Mirko
>>
>>
>> 2013/4/12 Michael Bauer <michael.bauer at okfn.org>
>>
>>> One thing that struck me interesting with Convextra is the multi-page
>>> scraping it does. This was unseen to me (never used needlebase though).
>>>
>>> Michael
>>>
>>> On Thu, Apr 11, 2013 at 02:56:37PM +0200, mirko.lorenz at gmail.com wrote:
>>> > I wish we would have Needlebase back. Would have solved a lot of
>>> issues,
>>> > but was probably too good, e.g. it was possible to scrape page by page
>>> with
>>> > relative ease.
>>> >
>>> > 2013/4/11 <SMachlis at computerworld.com>
>>> >
>>> > > Agreed, although if you're only scraping a couple of pages it's not
>>> too
>>> > > much of a problem to select all, copy, and paste into a local
>>> spreadsheet.
>>> > >
>>> > > ________________________________________
>>> > >
>>> > > Indeed and I would still recommend it over a purely web based
>>> service. It
>>> > > would be great if the scraper extension would allow local saving -
>>> instead
>>> > > of google docs export.
>>> > >
>>> > > Michael
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > data-driven-journalism mailing list
>>> > > data-driven-journalism at lists.okfn.org
>>> > > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>> > > Unsubscribe:
>>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>> > >
>>>
>>> > _______________________________________________
>>> > data-driven-journalism mailing list
>>> > data-driven-journalism at lists.okfn.org
>>> > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>> > Unsubscribe:
>>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>>
>>>
>>> --
>>> Data Wrangler with the Open Knowledge Foundation (OKFN.org)
>>> GPG/PGP key: http://tentacleriot.eu/mihi.asc
>>> Twitter: @mihi_tr Skype: mihi_tr
>>>
>>> _______________________________________________
>>> data-driven-journalism mailing list
>>> data-driven-journalism at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>>> Unsubscribe:
>>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>>
>>
>>
>> _______________________________________________
>> data-driven-journalism mailing list
>> data-driven-journalism at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>> Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>>
>>
>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20130412/25502f44/attachment-0001.html>


More information about the data-driven-journalism mailing list