[School-of-data] Scraping piracy data from a map using OpenRefine

Thomas Levine _ at thomaslevine.com
Sun Nov 9 11:56:03 UTC 2014

Or I'll write up the details of working around the SSL problem: Set the `verify` argument to `False` or to the name of the certificate file.

On November 8, 2014 11:21:11 AM PST, Tom Morris <tfmorris at gmail.com> wrote:
>[Pulling this into its own thread]
>On Fri, Nov 7, 2014 at 1:05 PM, Idoia Sota <idoiasota at gmail.com> wrote:
>> Dear All,
>>     I'm trying to scrape this map with no success at all:
>> and https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map
>>      There are two levels of information in it: the one on the
>tooltip and
>> the one on the link that appears on the tooltip.
>>      I've tryed ScraperWiki (Json), but it gives me an error (I don't
>> know if it makes sense to use it). And then tryed to code on
>> but getting the html code gave this error (image attached). I seems I
>> to have some certificate for the page. Nevertheless, I can see all
>the data
>> when I clic on "see the html code of this page". (image 2 attached)
>>    Can anybody tell me what would it be the best to do with this? Can
>> help me? Thank you so much!
>> Idoia
>You can get the basic data by using View Source in your browser on the
>page, locating the JSON map data and pasting it into OpenRefine using
>Clipboard source for the Create Project dialog.  It'll parse the JSON
>automatically and you can then use its UI to split apart fields with
>multiple types of data (e.g. the HTML detail for the map popup) and
>the contains of the detailed incident page.
>The JSON map data looks like this:
>new FbGoogleMapViz('table_map', {"icons":[{"0":"4.8833333333333","1":"
>-1.6833333333332803","2":"Attack ID: 264-13 <br \/> Date: 2013-12-31
>\/> Vessel: General Cargo <br \/> Status: Boarded <br \/> Full report:
>and the bit that you want to paste into OpenRefine (you could use
>JSON parsing too) is the part between the outer curly braces {}
>the braces).
>After doing all the transformations on the first page, you can reuse
>on the second page by extracting and applying the operation history to
>perform all the operations in a matter of seconds.
>Unfortunately, the site's web server is misconfigured which is why
>couldn't access it and Java (which OpenRefine is written in) wasn't any
>happier until I provided it with the missing SSL certificates and told
>to ignore the misconfigured SNI.  Before I did that, the first thing I
>tried was accessing the site over HTTP instead of HTTPS, but they've
>that set up to just redirect back to HTTPS.  Foiled!
>I can write up the details off how to work around the SSL problem if
>is interested, but it's an advanced topic (which should only be
>Attached are the CSV files with the data extracted from the two maps.
>OpenRefine projects are a little bulky for the list (4 MB) because they
>include the HTML for all the linked pages, but I'm happy to send them
>anyone who's interested.
>school-of-data mailing list
>school-of-data at lists.okfn.org
>Unsubscribe: https://lists.okfn.org/mailman/options/school-of-data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20141109/dfc9bffc/attachment-0002.html>

More information about the school-of-data mailing list