[OKCon-Programme] Fwd: Wikileaks Visualisation Project Proposal

Tue May 24 12:49:28 BST 2011

Here is an other late proposal.

Someone wants to give it a read?

Daniel

Begin forwarded message:

> From: Hannes Bretschneider <habretschneider at gmail.com>
> Date: 24. Mai 2011 12:44:20 MESZ
> To: okcon at okfn.org
> Subject: Wikileaks Visualisation Project Proposal
> 
> Dear OKCon organizers,
> 
> I'd like to propose a project for the OKCon 2011 hackday for the visualisation of the Wikileaks Iraq War Diaries. I'm academically involved in Machine Learning research and when the Wikileaks data was published I gained the strong impression that this data contained untapped potential in the form of hidden relationships, which could be better understood using modern Machine Learning algorithms. Several months ago I have therefore developed the prototype of an app to show how such hidden relationships could be intuitively revealed and visualised with Machine Learning tools.
> 
> My app uses the full text of the reports from the Iraq War Diaries which are written by the troops on the ground and encodes the text as numerical vectors by simple counting of words. I use the t-SNE algorithm (van der Maaten & Hinton, 2008) to project these high-dimensional (about 10,000 dimensions) vectors into the two-dimensional plane. The result is a "contextual map" of the data, in which each report is represented by a point on the map. Reports which use many common words are mapped close to one another. Therefore clusters emerge that collect reports to distinct topics. The data also contains categories for the reports such as "Explosive Hazard", "Enemy Action" and "Criminal Event" which can be used to color the points to encode more information. The interactive app makes it possible to interact with the data and link the map back to original text. Hovering over the points shows the titles of the reports to quickly gain an impression of the topics in different regions of the map. By clicking on the points, the user may view the full report, including date and time, title, full text and see the location of the event on a map. To better explore regions of the map, the data can be zoomed and panned around. Also, above a certain zoom level the five most relevant words for each report are shown beneath the point.
> 
> How this works can be better understood by watching this short screencast of the app in use: http://dl.dropbox.com/u/1599047/wikileaks-screencast.mov
> The attached presentation explains the algorithm in more detail (and somewhat more academically).
> 
> This visualisation therefore provides a novel way of quickly looking at such massive datasets as the Wikileaks data from "30,000 feet" above. Size and category of the clusters provide an immediate impression of the relative number of documents on each topic and category. Hovering and viewing the titles provides quick access to the types of documents in different clusters and regions. If a region of interest has been found, it can be further zoomed in and investigated by reading the full text reports.
> 
> For a journalist receiving such a large dataset this provides a more intuitive way of exploring the documents and their relationships. Usually, the journalist would try to filter the data by dates/times, categories or keywords to find interesting stories (http://www.guardian.co.uk/news/datablog/2011/jan/31/wikileaks-data-journalism, http://www.nytimes.com/2011/01/30/magazine/30Wikileaks-t.html). But this requires that she already knows what she's looking for and has a pre-conceived notion of what the data is about. The risk is therefore that interesting and newsworthy stories are missed because they are hidden in too much data like the forest among the trees. My approach is an attempt to "let the data speak" first to get a high-level overview of the data and aid serendipitous discovery of unexpected documents. Of course, this approach can still be combined with traditional filters and keyword searches.
> 
> For example, my algorithm clusters many reports together that were the source material for reports that the allied forces ignored torture and abuse of detainees by Iraqi security forces (http://www.guardian.co.uk/world/2010/oct/22/iraq-war-logs-military-leaks). Even without a pre-conceived notion of relevant keywords, my app would allowed to uncover this story in the data.
> 
> So far, my app is a proof-of-concept prototype and only runs locally. To generate some larger discussion among people involved in data journalism, visualisation and Machine Learning research I would like to to rewrite the app and create an interactive webpage to host the it. OKCon would provide an ideal venue to build an improved and web-ready version of the app during a hackday and then to seek some discussion with people from the journalism world during the conference. The goal would be to assemble a small team of maybe 5 people with technical skills and graphic design skills to finish a reimplementation during the hackday. Non-technical people from the journalism world would be welcome to help improve the design, ensure usability and promote the webpage on the internet and in the media. For the hackday, we would only need a small room for 5-10 people. Ideally, the OKFN could also provide webhosting. During the conference it might be good to have a small booth or space where the results can be demonstrated and the developers can discuss the project with visitors.
> 
> I have presented my project to several senior Machine Learning researchers who all support the idea and agree that it is a novel application of a sophisticated algorithm in an area where such methods have been rarely employed. This project would thus be a chance to build bridges between the academic Machine Learning world and the data journalism and data visualisation world. Given successful completion of the project, there's the possibility of publication in an academic journal, such as this upcoming special issue on data visualisation http://www.cc.gatech.edu/~lebanon/pub/cfp-10618-201103226.pdf
> 
> About me:
> I have studied Statistics and Computer Science at Humboldt-Universität Berlin and the University of British Columbia, focussing on Machine Learning research. Following my Master degree I will begin a PhD in Computer Science at the University of Toronto, working on Machine Learning applications in Genetics. I have previously contributed to OKFN by helping out on the yourtopia.net project with Guo Xu, Friedrich Lindenberg and Dirk Heine.
> 
> Best regards,
> Hannes Bretschneider
> 
> References:
> L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008. 
> More info about t-SNE: http://homepage.tudelft.nl/19j49/t-SNE.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/mailman/private/okcon-programme/attachments/20110524/44c98864/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: War Diary t-SNE.pdf
Type: application/pdf
Size: 2332526 bytes
Desc: not available
URL: <http://lists.okfn.org/mailman/private/okcon-programme/attachments/20110524/44c98864/attachment-0001.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/mailman/private/okcon-programme/attachments/20110524/44c98864/attachment-0003.htm>