[School-of-data] [ddj] Scraping piracy data from a map using OpenRefine

William Shubert (wshubert@INTERNEWS.ORG) wshubert at INTERNEWS.ORG
Mon Nov 10 02:13:58 UTC 2014

Hi everyone,

This is a fascinating thread. I’ve come across this issues multiple times and it’s great to have this workflow documented so beautifully.

I don’t want to be a wet blanket but this data is available on the IMO’s Global Integrated Shipping Information System<https://gisis.imo.org/Public/Default.aspx> website. You do have to register for an account but it is downloadable as an excel spreadsheet with coordinates and all relevant attributes included.

We used this IMO dataset for our Ekuatorial project to create a map of Ocean Commerce<http://ekuatorial.com/en/embed/?map_only=1&map_id=4959&width=960&height=480&lat=0.7470491450051796&lon=103.42529296875&zoom=6> in the area surrounding Indonesia including the Straights of Molucca and the South China Sea. Incident reports pop up upon click.

From: data-driven-journalism [mailto:data-driven-journalism-bounces at lists.okfn.org] On Behalf Of Thomas Levine
Sent: Sunday, November 09, 2014 6:56 AM
To: Mailing list for the School of Data, a joint initiative of the OKFN and P2PU; Tom Morris
Cc: List about Data Driven Journalism and Open Data in Journalism.
Subject: Re: [ddj] [School-of-data] Scraping piracy data from a map using OpenRefine

Or I'll write up the details of working around the SSL problem: Set the `verify` argument to `False` or to the name of the certificate file.
On November 8, 2014 11:21:11 AM PST, Tom Morris <tfmorris at gmail.com<mailto:tfmorris at gmail.com>> wrote:
[Pulling this into its own thread]

On Fri, Nov 7, 2014 at 1:05 PM, Idoia Sota <idoiasota at gmail.com<mailto:idoiasota at gmail.com>> wrote:
Dear All,

    I'm trying to scrape this map with no success at all: https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013 and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map

     There are two levels of information in it: the one on the tooltip and the one on the link that appears on the tooltip.

     I've tryed ScraperWiki (Json), but it gives me an error (I don't even know if it makes sense to use it). And then tryed to code on scraperwiki, but getting the html code gave this error (image attached). I seems I need to have some certificate for the page. Nevertheless, I can see all the data when I clic on "see the html code of this page". (image 2 attached)

   Can anybody tell me what would it be the best to do with this? Can you help me? Thank you so much!


You can get the basic data by using View Source in your browser on the page, locating the JSON map data and pasting it into OpenRefine using the Clipboard source for the Create Project dialog.  It'll parse the JSON automatically and you can then use its UI to split apart fields with multiple types of data (e.g. the HTML detail for the map popup) and fetch the contains of the detailed incident page.

The JSON map data looks like this:

new FbGoogleMapViz('table_map', {"icons":[{"0":"4.8833333333333","1":" -1.6833333333332803","2":"Attack ID: 264-13 <br \/> Date: 2013-12-31 <br \/> Vessel: General Cargo <br \/> Status: Boarded <br \/> Full report: <a class=\"fabrik___rowlink\" href=\"\/piracy-reporting-centre\/live-piracy-map\/piracy-map-2013\/details\/133\/623\" title=\"View\"><img src=\"https:\/\/www.icc-ccs.org<http://www.icc-ccs.org>\/media\/com_fabrik\/images\/view.png\" alt=\"View\" \/><span>View<\/span><\/a>","3":"orange-dot.png","4":25,"5":25,"groupkey":"0","listid":"144","title":""},



and the bit that you want to paste into OpenRefine (you could use Python's JSON parsing too) is the part between the outer curly braces {} (including the braces).

After doing all the transformations on the first page, you can reuse them on the second page by extracting and applying the operation history to perform all the operations in a matter of seconds.

Unfortunately, the site's web server is misconfigured which is why Python couldn't access it and Java (which OpenRefine is written in) wasn't any happier until I provided it with the missing SSL certificates and told it to ignore the misconfigured SNI.  Before I did that, the first thing I tried was accessing the site over HTTP instead of HTTPS, but they've got that set up to just redirect back to HTTPS.  Foiled!

I can write up the details off how to work around the SSL problem if anyone is interested, but it's an advanced topic (which should only be required rarely).

Attached are the CSV files with the data extracted from the two maps. The OpenRefine projects are a little bulky for the list (4 MB) because they include the HTML for all the linked pages, but I'm happy to send them to anyone who's interested.



school-of-data mailing list
school-of-data at lists.okfn.org<mailto:school-of-data at lists.okfn.org>
Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data

Click here<https://www.mailcontrol.com/sr/6mk4JbniiczGX2PQPOmvUtXHXnVjuGgHkqcRebDG1W3ypmjPqdgIczxU6k!vMRdRuS8X6zJT4i+W4Hi4JZPuPA==> to report this email as spam.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20141109/e8f850e7/attachment-0002.html>

More information about the school-of-data mailing list