[ddj] [School-of-data] Scraping piracy data from a map using OpenRefine

Mon Nov 10 19:59:25 UTC 2014

Thanks much for the explanation.

I think that even if it's a problem we might not encounter often, it
deserves its own tutorial. Someone at some point will need this solution,
and writing a tutorial helps translating the solution in several languages.

Cédric Lombion
Project and Community Coordinator
Open Knowledge Foundation France
@clombion / +33 673 863 914

2014-11-08 20:21 GMT+01:00 Tom Morris <tfmorris at gmail.com>:

> [Pulling this into its own thread]
>
> On Fri, Nov 7, 2014 at 1:05 PM, Idoia Sota <idoiasota at gmail.com> wrote:
>
>> Dear All,
>>
>>     I'm trying to scrape this map with no success at all:
>> https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013
>> and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map
>>
>>      There are two levels of information in it: the one on the tooltip
>> and the one on the link that appears on the tooltip.
>>
>>      I've tryed ScraperWiki (Json), but it gives me an error (I don't
>> even know if it makes sense to use it). And then tryed to code on
>> scraperwiki, but getting the html code gave this error (image attached). I
>> seems I need to have some certificate for the page. Nevertheless, I can see
>> all the data when I clic on "see the html code of this page". (image 2
>> attached)
>>
>>    Can anybody tell me what would it be the best to do with this? Can you
>> help me? Thank you so much!
>>
>> Idoia
>>
>
> You can get the basic data by using View Source in your browser on the
> page, locating the JSON map data and pasting it into OpenRefine using the
> Clipboard source for the Create Project dialog.  It'll parse the JSON
> automatically and you can then use its UI to split apart fields with
> multiple types of data (e.g. the HTML detail for the map popup) and fetch
> the contains of the detailed incident page.
>
> The JSON map data looks like this:
>
> new FbGoogleMapViz('table_map', {"icons":[{"0":"4.8833333333333","1":"
> -1.6833333333332803","2":"Attack ID: 264-13 <br \/> Date: 2013-12-31 <br
> \/> Vessel: General Cargo <br \/> Status: Boarded <br \/> Full report: <a
> class=\"fabrik___rowlink\"
> href=\"\/piracy-reporting-centre\/live-piracy-map\/piracy-map-2013\/details\/133\/623\"
> title=\"View\"><img src=\"https:\/\/www.icc-ccs.org\/media\/com_fabrik\/images\/view.png\"
> alt=\"View\"
> \/><span>View<\/span><\/a>","3":"orange-dot.png","4":25,"5":25,"groupkey":"0","listid":"144","title":""},
>
> ...
>
>
> }],"polyline":[],"id":"46","zoomlevel":2,"scalecontrol":false,"maptypecontrol":true,"overviewcontrol":false,"center":"middle","ajax_refresh":false,"ajax_refresh_center":"1","maptype":"G_HYBRID_MAP","clustering":false,"cluster_splits":"10,50","icon_increment":"2","refresh_rate":"10000","use_cookies":true,"container":"googlemap_46","polylinewidth":["10"],"polylinecolour":["#CCFFFF"],"overlay_urls":[],"overlay_labels":[],"use_overlays":0,"use_overlays_sidebar":false,"groupTemplates":[],"zoomStyle":0,"zoom":"1"})
>
>
> and the bit that you want to paste into OpenRefine (you could use Python's
> JSON parsing too) is the part between the outer curly braces {} (including
> the braces).
>
> After doing all the transformations on the first page, you can reuse them
> on the second page by extracting and applying the operation history to
> perform all the operations in a matter of seconds.
>
> Unfortunately, the site's web server is misconfigured which is why Python
> couldn't access it and Java (which OpenRefine is written in) wasn't any
> happier until I provided it with the missing SSL certificates and told it
> to ignore the misconfigured SNI.  Before I did that, the first thing I
> tried was accessing the site over HTTP instead of HTTPS, but they've got
> that set up to just redirect back to HTTPS.  Foiled!
>
> I can write up the details off how to work around the SSL problem if
> anyone is interested, but it's an advanced topic (which should only be
> required rarely).
>
> Attached are the CSV files with the data extracted from the two maps. The
> OpenRefine projects are a little bulky for the list (4 MB) because they
> include the HTML for all the linked pages, but I'm happy to send them to
> anyone who's interested.
>
> Tom
>
>
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20141110/3d981c09/attachment-0003.html>