[ddj] Scraping piracy data from a map using OpenRefine

Sat Nov 8 19:21:11 UTC 2014

[Pulling this into its own thread]

On Fri, Nov 7, 2014 at 1:05 PM, Idoia Sota <idoiasota at gmail.com> wrote:

> Dear All,
>
>     I'm trying to scrape this map with no success at all:
> https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013
> and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map
>
>      There are two levels of information in it: the one on the tooltip and
> the one on the link that appears on the tooltip.
>
>      I've tryed ScraperWiki (Json), but it gives me an error (I don't even
> know if it makes sense to use it). And then tryed to code on scraperwiki,
> but getting the html code gave this error (image attached). I seems I need
> to have some certificate for the page. Nevertheless, I can see all the data
> when I clic on "see the html code of this page". (image 2 attached)
>
>    Can anybody tell me what would it be the best to do with this? Can you
> help me? Thank you so much!
>
> Idoia
>

You can get the basic data by using View Source in your browser on the
page, locating the JSON map data and pasting it into OpenRefine using the
Clipboard source for the Create Project dialog.  It'll parse the JSON
automatically and you can then use its UI to split apart fields with
multiple types of data (e.g. the HTML detail for the map popup) and fetch
the contains of the detailed incident page.

The JSON map data looks like this:

new FbGoogleMapViz('table_map', {"icons":[{"0":"4.8833333333333","1":"
-1.6833333333332803","2":"Attack ID: 264-13 <br \/> Date: 2013-12-31 <br
\/> Vessel: General Cargo <br \/> Status: Boarded <br \/> Full report: <a
class=\"fabrik___rowlink\"
href=\"\/piracy-reporting-centre\/live-piracy-map\/piracy-map-2013\/details\/133\/623\"
title=\"View\"><img
src=\"https:\/\/www.icc-ccs.org\/media\/com_fabrik\/images\/view.png\"
alt=\"View\"
\/><span>View<\/span><\/a>","3":"orange-dot.png","4":25,"5":25,"groupkey":"0","listid":"144","title":""},

...

}],"polyline":[],"id":"46","zoomlevel":2,"scalecontrol":false,"maptypecontrol":true,"overviewcontrol":false,"center":"middle","ajax_refresh":false,"ajax_refresh_center":"1","maptype":"G_HYBRID_MAP","clustering":false,"cluster_splits":"10,50","icon_increment":"2","refresh_rate":"10000","use_cookies":true,"container":"googlemap_46","polylinewidth":["10"],"polylinecolour":["#CCFFFF"],"overlay_urls":[],"overlay_labels":[],"use_overlays":0,"use_overlays_sidebar":false,"groupTemplates":[],"zoomStyle":0,"zoom":"1"})

and the bit that you want to paste into OpenRefine (you could use Python's
JSON parsing too) is the part between the outer curly braces {} (including
the braces).

After doing all the transformations on the first page, you can reuse them
on the second page by extracting and applying the operation history to
perform all the operations in a matter of seconds.

Unfortunately, the site's web server is misconfigured which is why Python
couldn't access it and Java (which OpenRefine is written in) wasn't any
happier until I provided it with the missing SSL certificates and told it
to ignore the misconfigured SNI.  Before I did that, the first thing I
tried was accessing the site over HTTP instead of HTTPS, but they've got
that set up to just redirect back to HTTPS.  Foiled!

I can write up the details off how to work around the SSL problem if anyone
is interested, but it's an advanced topic (which should only be required
rarely).

Attached are the CSV files with the data extracted from the two maps. The
OpenRefine projects are a little bulky for the list (4 MB) because they
include the HTML for all the linked pages, but I'm happy to send them to
anyone who's interested.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20141108/5bc9bc92/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: https-www-icc-ccs-org-piracy-reporting-centre-live-piracy-map-piracy-map-2013.csv
Type: text/csv
Size: 155062 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20141108/5bc9bc92/attachment-0004.csv>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: https-www-icc-ccs-org-piracy-reporting-centre-live-piracy-map.csv
Type: text/csv
Size: 120033 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20141108/5bc9bc92/attachment-0005.csv>