[ddj] scraping data with a bookmarklet: Convextra

Fri Apr 12 09:29:57 UTC 2013

Hey,

On Fri, Apr 12, 2013 at 11:01 AM, mirko.lorenz at gmail.com <
mirko.lorenz at gmail.com> wrote:

> With Needlebase you where able to define what needed to be scraped from an
> overview page (e.g. list of all parliament members in Germany), then you
> could define what links to follow and what fields to scrape. With a bit of
> effort you created a script for multipage scraping and could sent out the
> application to collect the defined information - all the results where then
> collected into a table. Neat.

I was playing with the new ScraperWiki prototype yesterday (yay!) and
they've made this much more DIY/hard-core: you can basically make your own
apps on top of their platform now. I'm wondering if one could build a
Needlebase-style WYSIWYG multi-level scraper on top of it. Basically: fetch
remote web sites, add JS to allow highlighting of data fields and then
extract data automatically based on these patterns.

Would anybody be up for exploring this?

- Friedrich

>
> Kind of amazing that there was a good solution (as said in another mail:
> maybe a bit too good) and that it simply was taken down. There was no real
> communication why that happened, I had contacted the team directly even.
>
> Point is: There would be an opportunity here, although I am not sure
> whether that could be sustainable.
>
> /Mirko
>
>
> 2013/4/12 Michael Bauer <michael.bauer at okfn.org>
>
>> One thing that struck me interesting with Convextra is the multi-page
>> scraping it does. This was unseen to me (never used needlebase though).
>>
>> Michael
>>
>> On Thu, Apr 11, 2013 at 02:56:37PM +0200, mirko.lorenz at gmail.com wrote:
>> > I wish we would have Needlebase back. Would have solved a lot of issues,
>> > but was probably too good, e.g. it was possible to scrape page by page
>> with
>> > relative ease.
>> >
>> > 2013/4/11 <SMachlis at computerworld.com>
>> >
>> > > Agreed, although if you're only scraping a couple of pages it's not
>> too
>> > > much of a problem to select all, copy, and paste into a local
>> spreadsheet.
>> > >
>> > > ________________________________________
>> > >
>> > > Indeed and I would still recommend it over a purely web based
>> service. It
>> > > would be great if the scraper extension would allow local saving -
>> instead
>> > > of google docs export.
>> > >
>> > > Michael
>> > >
>> > >
>> > > _______________________________________________
>> > > data-driven-journalism mailing list
>> > > data-driven-journalism at lists.okfn.org
>> > > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>> > > Unsubscribe:
>> http://lists.okfn.org/mailman/options/data-driven-journalism
>> > >
>>
>> > _______________________________________________
>> > data-driven-journalism mailing list
>> > data-driven-journalism at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>> > Unsubscribe:
>> http://lists.okfn.org/mailman/options/data-driven-journalism
>>
>>
>> --
>> Data Wrangler with the Open Knowledge Foundation (OKFN.org)
>> GPG/PGP key: http://tentacleriot.eu/mihi.asc
>> Twitter: @mihi_tr Skype: mihi_tr
>>
>> _______________________________________________
>> data-driven-journalism mailing list
>> data-driven-journalism at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>> Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>>
>
>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: http://lists.okfn.org/mailman/options/data-driven-journalism
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20130412/6395a85f/attachment-0001.html>