[ddj] Geocoding tutorials on the School of Data Blog

Brandon Roberts brandon at bxroberts.org
Sat Nov 8 17:10:24 UTC 2014


Hello,

Long time scraper here. I took a look at the map and found a way for you
to scrape the points / data.

Each point on the map links to a "detail" page. I.e.,
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013/details/133/530

The only part that changes is that last number (530), which appears to
be an ID field in their DB.

You can brute-force scrape the entire DB by stepping through all the
IDs. So write some code that grabs
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013/details/133/1
through
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013/details/133/XXX
(where XXX is the highest ID you find). The points and all the data is
found in the same XPath position on the page, so it will be trivial to
extract it.

This works on all kinds of sites who aren't expecting it.

If you're using a unix-based operating system, this shell code will
download the IDs 1 through 1000:

for I in {1..1000}; do wget --no-check-certificate
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013/details/133/$I;
done

Hope this helps,
Brandon

-- 
Brandon Roberts
http://bxroberts.org
c. 3608703022 o. 5122212136
Developer, Data Journalist, Hacker


On 11/07/2014 12:05 PM, Idoia Sota wrote:
> Dear All,
>
>     I'm trying to scrape this map with no success at
> all: https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013
> and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map
>
>      There are two levels of information in it: the one on the tooltip
> and the one on the link that appears on the tooltip. 
>
>      I've tryed ScraperWiki (Json), but it gives me an error (I don't
> even know if it makes sense to use it). And then tryed to code on
> scraperwiki, but getting the html code gave this error (image
> attached). I seems I need to have some certificate for the page.
> Nevertheless, I can see all the data when I clic on "see the html code
> of this page". (image 2 attached)
>
>    Can anybody tell me what would it be the best to do with this? Can
> you help me? Thank you so much!
>
> Idoia
>
>
>
>
> 2013-02-19 14:55 GMT+01:00 Lucy Chambers <lucy.chambers at okfn.org
> <mailto:lucy.chambers at okfn.org>>:
>
>     Hi All, 
>
>     If anyone has ever wanted to know how to convert simple place
>     names in a spreadsheet to lat and long values so that they can put
>     their data on a map, Rufus Pollock has just put up a couple of
>     tutorials on the School of Data blog.
>
>     An introduction to
>     Geocoding: http://schoolofdata.org/2013/02/19/geocoding-part-i-introduction-to-geocoding/
>
>     Geocoding in a Google Docs
>     Spreadsheet: http://schoolofdata.org/2013/02/19/geocoding-part-ii-geocoding-data-in-a-google-docs-spreadsheet/ 
>
>     We'll be looking to port them over to the School of Data Handbook
>     in the near future, so please let us know what you think of them
>     (feel free to use the blog comments or the mailing lists!). 
>
>     More soon, 
>
>     Lucy 
>
>     _______________________________________________
>     data-driven-journalism mailing list
>     data-driven-journalism at lists.okfn.org
>     <mailto:data-driven-journalism at lists.okfn.org>
>     http://lists.okfn.org/mailman/listinfo/data-driven-journalism
>     Unsubscribe:
>     http://lists.okfn.org/mailman/options/data-driven-journalism
>
>
>
>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: https://lists.okfn.org/mailman/options/data-driven-journalism

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20141108/9d94a7dd/attachment-0003.html>


More information about the data-driven-journalism mailing list