[School-of-data] [ddj] Scraping piracy data from a map using OpenRefine

Thu Nov 13 14:34:50 UTC 2014

Hi Edwin,

Thank you very much. Actually I already spoke with Willem: he's done all
the work until 2010 ;)

Idoia

2014-11-13 14:48 GMT+01:00 Erwin Verbruggen <everbruggen at beeldengeluid.nl>:

> There’s also a project around publishing piracy data as LOD:
> http://datahub.io/dataset/linked-open-piracy
>
> Bestest,
> Erwin
>
>
> On 9 nov. 2014, at 12:56, Thomas Levine <_ at thomaslevine.com> wrote:
>
> Or I'll write up the details of working around the SSL problem: Set the
> `verify` argument to `False` or to the name of the certificate file.
> http://docs.python-requests.org/en/latest/user/advanced/
>
> On November 8, 2014 11:21:11 AM PST, Tom Morris <tfmorris at gmail.com>
> wrote:
>>
>> [Pulling this into its own thread]
>>
>> On Fri, Nov 7, 2014 at 1:05 PM, Idoia Sota <idoiasota at gmail.com> wrote:
>>
>>> Dear All,
>>>
>>>     I'm trying to scrape this map with no success at all:
>>> https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/piracy-map-2013
>>> and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map
>>>
>>>      There are two levels of information in it: the one on the tooltip
>>> and the one on the link that appears on the tooltip.
>>>
>>>      I've tryed ScraperWiki (Json), but it gives me an error (I don't
>>> even know if it makes sense to use it). And then tryed to code on
>>> scraperwiki, but getting the html code gave this error (image attached). I
>>> seems I need to have some certificate for the page. Nevertheless, I can see
>>> all the data when I clic on "see the html code of this page". (image 2
>>> attached)
>>>
>>>    Can anybody tell me what would it be the best to do with this? Can
>>> you help me? Thank you so much!
>>>
>>> Idoia
>>>
>>
>> You can get the basic data by using View Source in your browser on the
>> page, locating the JSON map data and pasting it into OpenRefine using the
>> Clipboard source for the Create Project dialog.  It'll parse the JSON
>> automatically and you can then use its UI to split apart fields with
>> multiple types of data (e.g. the HTML detail for the map popup) and fetch
>> the contains of the detailed incident page.
>>
>> The JSON map data looks like this:
>>
>> new FbGoogleMapViz('table_map', {"icons":[{"0":"4.8833333333333","1":"
>> -1.6833333333332803","2":"Attack ID: 264-13 <br \/> Date: 2013-12-31 <br
>> \/> Vessel: General Cargo <br \/> Status: Boarded <br \/> Full report: <a
>> class=\"fabrik___rowlink\"
>> href=\"\/piracy-reporting-centre\/live-piracy-map\/piracy-map-2013\/details\/133\/623\"
>> title=\"View\"><img src=\"https:\/\/www.icc-ccs.org\/media\/com_fabrik\/images\/view.png\"
>> alt=\"View\"
>> \/><span>View<\/span><\/a>","3":"orange-dot.png","4":25,"5":25,"groupkey":"0","listid":"144","title":""},
>>
>> ...
>>
>>
>> }],"polyline":[],"id":"46","zoomlevel":2,"scalecontrol":false,"maptypecontrol":true,"overviewcontrol":false,"center":"middle","ajax_refresh":false,"ajax_refresh_center":"1","maptype":"G_HYBRID_MAP","clustering":false,"cluster_splits":"10,50","icon_increment":"2","refresh_rate":"10000","use_cookies":true,"container":"googlemap_46","polylinewidth":["10"],"polylinecolour":["#CCFFFF"],"overlay_urls":[],"overlay_labels":[],"use_overlays":0,"use_overlays_sidebar":false,"groupTemplates":[],"zoomStyle":0,"zoom":"1"})
>>
>>
>> and the bit that you want to paste into OpenRefine (you could use
>> Python's JSON parsing too) is the part between the outer curly braces {}
>> (including the braces).
>>
>> After doing all the transformations on the first page, you can reuse them
>> on the second page by extracting and applying the operation history to
>> perform all the operations in a matter of seconds.
>>
>> Unfortunately, the site's web server is misconfigured which is why Python
>> couldn't access it and Java (which OpenRefine is written in) wasn't any
>> happier until I provided it with the missing SSL certificates and told it
>> to ignore the misconfigured SNI.  Before I did that, the first thing I
>> tried was accessing the site over HTTP instead of HTTPS, but they've got
>> that set up to just redirect back to HTTPS.  Foiled!
>>
>> I can write up the details off how to work around the SSL problem if
>> anyone is interested, but it's an advanced topic (which should only be
>> required rarely).
>>
>> Attached are the CSV files with the data extracted from the two maps. The
>> OpenRefine projects are a little bulky for the list (4 MB) because they
>> include the HTML for all the linked pages, but I'm happy to send them to
>> anyone who's interested.
>>
>> Tom
>>
>>
>> ------------------------------
>>
>> school-of-data mailing list
>> school-of-data at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/school-of-data
>> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>>
>> _______________________________________________
> school-of-data mailing list
> school-of-data at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
>
>
>
> _______________________________________________
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/data-driven-journalism
> Unsubscribe: https://lists.okfn.org/mailman/options/data-driven-journalism
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20141113/c5d63bcf/attachment-0002.html>