[School-of-data] Scraping piracy data from a map using OpenRefine

Sun Nov 9 11:56:03 UTC 2014

Or I'll write up the details of working around the SSL problem: Set the `verify` argument to `False` or to the name of the certificate file.
http://docs.python-requests.org/en/latest/user/advanced/

On November 8, 2014 11:21:11 AM PST, Tom Morris <tfmorris at gmail.com> wrote:
>[Pulling this into its own thread]
>
>On Fri, Nov 7, 2014 at 1:05 PM, Idoia Sota <idoiasota at gmail.com> wrote:
>
>> Dear All,
>>
>>     I'm trying to scrape this map with no success at all:
>>
>https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013
>> and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map
>>
>>      There are two levels of information in it: the one on the
>tooltip and
>> the one on the link that appears on the tooltip.
>>
>>      I've tryed ScraperWiki (Json), but it gives me an error (I don't
>even
>> know if it makes sense to use it). And then tryed to code on
>scraperwiki,
>> but getting the html code gave this error (image attached). I seems I
>need
>> to have some certificate for the page. Nevertheless, I can see all
>the data
>> when I clic on "see the html code of this page". (image 2 attached)
>>
>>    Can anybody tell me what would it be the best to do with this? Can
>you
>> help me? Thank you so much!
>>
>> Idoia
>>
>
>You can get the basic data by using View Source in your browser on the
>page, locating the JSON map data and pasting it into OpenRefine using
>the
>Clipboard source for the Create Project dialog.  It'll parse the JSON
>automatically and you can then use its UI to split apart fields with
>multiple types of data (e.g. the HTML detail for the map popup) and
>fetch
>the contains of the detailed incident page.
>
>The JSON map data looks like this:
>
>new FbGoogleMapViz('table_map', {"icons":[{"0":"4.8833333333333","1":"
>-1.6833333333332803","2":"Attack ID: 264-13 <br \/> Date: 2013-12-31
><br
>\/> Vessel: General Cargo <br \/> Status: Boarded <br \/> Full report:
><a
>class=\"fabrik___rowlink\"
>href=\"\/piracy-reporting-centre\/live-piracy-map\/piracy-map-2013\/details\/133\/623\"
>title=\"View\"><img
>src=\"https:\/\/www.icc-ccs.org\/media\/com_fabrik\/images\/view.png\"
>alt=\"View\"
>\/><span>View<\/span><\/a>","3":"orange-dot.png","4":25,"5":25,"groupkey":"0","listid":"144","title":""},
>
>...
>
>}],"polyline":[],"id":"46","zoomlevel":2,"scalecontrol":false,"maptypecontrol":true,"overviewcontrol":false,"center":"middle","ajax_refresh":false,"ajax_refresh_center":"1","maptype":"G_HYBRID_MAP","clustering":false,"cluster_splits":"10,50","icon_increment":"2","refresh_rate":"10000","use_cookies":true,"container":"googlemap_46","polylinewidth":["10"],"polylinecolour":["#CCFFFF"],"overlay_urls":[],"overlay_labels":[],"use_overlays":0,"use_overlays_sidebar":false,"groupTemplates":[],"zoomStyle":0,"zoom":"1"})
>
>
>and the bit that you want to paste into OpenRefine (you could use
>Python's
>JSON parsing too) is the part between the outer curly braces {}
>(including
>the braces).
>
>After doing all the transformations on the first page, you can reuse
>them
>on the second page by extracting and applying the operation history to
>perform all the operations in a matter of seconds.
>
>Unfortunately, the site's web server is misconfigured which is why
>Python
>couldn't access it and Java (which OpenRefine is written in) wasn't any
>happier until I provided it with the missing SSL certificates and told
>it
>to ignore the misconfigured SNI.  Before I did that, the first thing I
>tried was accessing the site over HTTP instead of HTTPS, but they've
>got
>that set up to just redirect back to HTTPS.  Foiled!
>
>I can write up the details off how to work around the SSL problem if
>anyone
>is interested, but it's an advanced topic (which should only be
>required
>rarely).
>
>Attached are the CSV files with the data extracted from the two maps.
>The
>OpenRefine projects are a little bulky for the list (4 MB) because they
>include the HTML for all the linked pages, but I'm happy to send them
>to
>anyone who's interested.
>
>Tom
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>school-of-data mailing list
>school-of-data at lists.okfn.org
>https://lists.okfn.org/mailman/listinfo/school-of-data
>Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20141109/dfc9bffc/attachment-0002.html>