[ddj] [School-of-data] Scraping piracy data from a map using OpenRefine

Erwin Verbruggen everbruggen at beeldengeluid.nl
Thu Nov 13 13:48:17 UTC 2014


There’s also a project around publishing piracy data as LOD:
http://datahub.io/dataset/linked-open-piracy

Bestest,
Erwin


On 9 nov. 2014, at 12:56, Thomas Levine <_ at thomaslevine.com> wrote:

> Or I'll write up the details of working around the SSL problem: Set the `verify` argument to `False` or to the name of the certificate file.
> http://docs.python-requests.org/en/latest/user/advanced/
> 
> On November 8, 2014 11:21:11 AM PST, Tom Morris <tfmorris at gmail.com> wrote:
> [Pulling this into its own thread]
> 
> On Fri, Nov 7, 2014 at 1:05 PM, Idoia Sota <idoiasota at gmail.com> wrote:
> Dear All,
> 
>     I'm trying to scrape this map with no success at all: https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013 and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map
> 
>      There are two levels of information in it: the one on the tooltip and the one on the link that appears on the tooltip. 
> 
>      I've tryed ScraperWiki (Json), but it gives me an error (I don't even know if it makes sense to use it). And then tryed to code on scraperwiki, but getting the html code gave this error (image attached). I seems I need to have some certificate for the page. Nevertheless, I can see all the data when I clic on "see the html code of this page". (image 2 attached)
> 
>    Can anybody tell me what would it be the best to do with this? Can you help me? Thank you so much!
> 
> Idoia
> 
> You can get the basic data by using View Source in your browser on the page, locating the JSON map data and pasting it into OpenRefine using the Clipboard source for the Create Project dialog.  It'll parse the JSON automatically and you can then use its UI to split apart fields with multiple types of data (e.g. the HTML detail for the map popup) and fetch the contains of the detailed incident page.
> 
> The JSON map data looks like this:
> 
> new FbGoogleMapViz('table_map', {"icons":[{"0":"4.8833333333333","1":" -1.6833333333332803","2":"Attack ID: 264-13 <br \/> Date: 2013-12-31 <br \/> Vessel: General Cargo <br \/> Status: Boarded <br \/> Full report: <a class=\"fabrik___rowlink\"  href=\"\/piracy-reporting-centre\/live-piracy-map\/piracy-map-2013\/details\/133\/623\" title=\"View\"><img src=\"https:\/\/www.icc-ccs.org\/media\/com_fabrik\/images\/view.png\" alt=\"View\"
> \/><span>View<\/span><\/a>","3":"orange-dot.png","4":25,"5":25,"groupkey":"0","listid":"144","title":""},
> 
> ...
> 
> }],"polyline":[],"id":"46","zoomlevel":2,"scalecontrol":false,"maptypecontrol":true,"overviewcontrol":false,"center":"middle","ajax_refresh":false,"ajax_refresh_center":"1","maptype":"G_HYBRID_MAP","clustering":false,"cluster_splits":"10,50","icon_increment":"2","refresh_rate":"10000","use_cookies":true,"container":"googlemap_46","polylinewidth":["10"],"polylinecolour":["#CCFFFF"],"overlay_urls":[],"overlay_labels":[],"use_overlays":0,"use_overlays_sidebar":false,"groupTemplates":[],"zoomStyle":0,"zoom":"1"})
> 
> 
> and the bit that you want to paste into OpenRefine (you could use Python's JSON parsing too) is the part between the outer curly braces {} (including the braces).
> 
> After doing all the transformations on the first page, you can reuse them on the second page by extracting and applying the operation history to perform all the operations in a matter of seconds.
> 
> Unfortunately, the site's web server is misconfigured which is why Python couldn't access it and Java (which OpenRefine is written in) wasn't any happier until I provided it with the missing SSL certificates and told it to ignore the misconfigured SNI.  Before I did that, the first thing I tried was accessing the site over HTTP instead of HTTPS, but they've got that set up to just redirect back to HTTPS.  Foiled!
> 
> I can write up the details off how to work around the SSL problem if anyone is interested, but it's an advanced topic (which should only be required rarely).
> 
> Attached are the CSV files with the data extracted from the two maps. The OpenRefine projects are a little bulky for the list (4 MB) because they include the HTML for all the linked pages, but I'm happy to send them to anyone who's interested.
> 
> Tom
> 
> 
> 
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20141113/d42ffc00/attachment-0003.html>


More information about the data-driven-journalism mailing list