[OKCon-Programme] Fwd: Wikileaks Visualisation Project Proposal

David Eaves david at eaves.ca
Tue May 24 18:00:55 BST 2011

Hi Daniel - I'm afraid the math in this is beyond me, but I like the 


On 11-05-24 4:49 AM, Daniel Dietrich wrote:
> Here is an other late proposal.
> Someone wants to give it a read?
> Daniel
> Begin forwarded message:
>> *From: *Hannes Bretschneider <habretschneider at gmail.com 
>> <mailto:habretschneider at gmail.com>>
>> *Date: *24. Mai 2011 12:44:20 MESZ
>> *To: *okcon at okfn.org <mailto:okcon at okfn.org>
>> *Subject: **Wikileaks Visualisation Project Proposal*
>> Dear OKCon organizers,
>> I'd like to propose a project for the OKCon 2011 hackday for the 
>> visualisation of the Wikileaks Iraq War Diaries. I'm academically 
>> involved in Machine Learning research and when the Wikileaks data was 
>> published I gained the strong impression that this data contained 
>> untapped potential in the form of hidden relationships, which could 
>> be better understood using modern Machine Learning algorithms. 
>> Several months ago I have therefore developed the prototype of an app 
>> to show how such hidden relationships could be intuitively revealed 
>> and visualised with Machine Learning tools.
>> My app uses the full text of the reports from the Iraq War Diaries 
>> which are written by the troops on the ground and encodes the text as 
>> numerical vectors by simple counting of words. I use the t-SNE 
>> algorithm (van der Maaten & Hinton, 2008) to project these 
>> high-dimensional (about 10,000 dimensions) vectors into the 
>> two-dimensional plane. The result is a "contextual map" of the data, 
>> in which each report is represented by a point on the map. Reports 
>> which use many common words are mapped close to one another. 
>> Therefore clusters emerge that collect reports to distinct topics. 
>> The data also contains categories for the reports such as "Explosive 
>> Hazard", "Enemy Action" and "Criminal Event" which can be used to 
>> color the points to encode more information. The interactive app 
>> makes it possible to interact with the data and link the map back to 
>> original text. Hovering over the points shows the titles of the 
>> reports to quickly gain an impression of the topics in different 
>> regions of the map. By clicking on the points, the user may view the 
>> full report, including date and time, title, full text and see the 
>> location of the event on a map. To better explore regions of the map, 
>> the data can be zoomed and panned around. Also, above a certain zoom 
>> level the five most relevant words for each report are shown beneath 
>> the point.
>> How this works can be better understood by watching this short 
>> screencast of the app in use: 
>> http://dl.dropbox.com/u/1599047/wikileaks-screencast.mov
>> The attached presentation explains the algorithm in more detail (and 
>> somewhat more academically).
>> This visualisation therefore provides a novel way of quickly looking 
>> at such massive datasets as the Wikileaks data from "30,000 feet" 
>> above. Size and category of the clusters provide an immediate 
>> impression of the relative number of documents on each topic and 
>> category. Hovering and viewing the titles provides quick access to 
>> the types of documents in different clusters and regions. If a region 
>> of interest has been found, it can be further zoomed in and 
>> investigated by reading the full text reports.
>> For a journalist receiving such a large dataset this provides a more 
>> intuitive way of exploring the documents and their relationships. 
>> Usually, the journalist would try to filter the data by dates/times, 
>> categories or keywords to find interesting stories 
>> (http://www.guardian.co.uk/news/datablog/2011/jan/31/wikileaks-data-journalism, 
>> http://www.nytimes.com/2011/01/30/magazine/30Wikileaks-t.html). But 
>> this requires that she already knows what she's looking for and has a 
>> pre-conceived notion of what the data is about. The risk is therefore 
>> that interesting and newsworthy stories are missed because they are 
>> hidden in too much data like the forest among the trees. My approach 
>> is an attempt to "let the data speak" first to get a high-level 
>> overview of the data and aid serendipitous discovery of unexpected 
>> documents. Of course, this approach can still be combined with 
>> traditional filters and keyword searches.
>> For example, my algorithm clusters many reports together that were 
>> the source material for reports that the allied forces ignored 
>> torture and abuse of detainees by Iraqi security forces 
>> (http://www.guardian.co.uk/world/2010/oct/22/iraq-war-logs-military-leaks). 
>> Even without a pre-conceived notion of relevant keywords, my app 
>> would allowed to uncover this story in the data.
>> So far, my app is a proof-of-concept prototype and only runs locally. 
>> To generate some larger discussion among people involved in data 
>> journalism, visualisation and Machine Learning research I would like 
>> to to rewrite the app and create an interactive webpage to host the 
>> it. OKCon would provide an ideal venue to build an improved and 
>> web-ready version of the app during a hackday and then to seek some 
>> discussion with people from the journalism world during the 
>> conference. The goal would be to assemble a small team of maybe 5 
>> people with technical skills and graphic design skills to finish a 
>> reimplementation during the hackday. Non-technical people from the 
>> journalism world would be welcome to help improve the design, ensure 
>> usability and promote the webpage on the internet and in the media. 
>> For the hackday, we would only need a small room for 5-10 people. 
>> Ideally, the OKFN could also provide webhosting. During the 
>> conference it might be good to have a small booth or space where the 
>> results can be demonstrated and the developers can discuss the 
>> project with visitors.
>> I have presented my project to several senior Machine Learning 
>> researchers who all support the idea and agree that it is a novel 
>> application of a sophisticated algorithm in an area where such 
>> methods have been rarely employed. This project would thus be a 
>> chance to build bridges between the academic Machine Learning world 
>> and the data journalism and data visualisation world. Given 
>> successful completion of the project, there's the possibility of 
>> publication in an academic journal, such as this upcoming special 
>> issue on data visualisation 
>> http://www.cc.gatech.edu/~lebanon/pub/cfp-10618-201103226.pdf 
>> <http://www.cc.gatech.edu/%7Elebanon/pub/cfp-10618-201103226.pdf>
>> About me:
>> I have studied Statistics and Computer Science at 
>> Humboldt-Universität Berlin and the University of British Columbia, 
>> focussing on Machine Learning research. Following my Master degree I 
>> will begin a PhD in Computer Science at the University of Toronto, 
>> working on Machine Learning applications in Genetics. I have 
>> previously contributed to OKFN by helping out on the yourtopia.net 
>> <http://yourtopia.net/> project with Guo Xu, Friedrich Lindenberg and 
>> Dirk Heine.
>> Best regards,
>> Hannes Bretschneider
>> References:
>> L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional 
>> Data Using t-SNE. Journal of Machine Learning 
>> Research 9(Nov):2579-2605, 2008.
>> More info about t-SNE: http://homepage.tudelft.nl/19j49/t-SNE.html
> _______________________________________________
> OKCon-Programme mailing list
> OKCon-Programme at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okcon-programme
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/mailman/private/okcon-programme/attachments/20110524/41007273/attachment.htm>

More information about the OKCon-Programme mailing list