[OKCon-Programme] Fwd: Wikileaks Visualisation Project Proposal
David Eaves
david at eaves.ca
Tue May 24 18:00:55 BST 2011
Hi Daniel - I'm afraid the math in this is beyond me, but I like the
concept.
dave
On 11-05-24 4:49 AM, Daniel Dietrich wrote:
> Here is an other late proposal.
>
> Someone wants to give it a read?
>
> Daniel
>
> Begin forwarded message:
>
>> *From: *Hannes Bretschneider <habretschneider at gmail.com
>> <mailto:habretschneider at gmail.com>>
>> *Date: *24. Mai 2011 12:44:20 MESZ
>> *To: *okcon at okfn.org <mailto:okcon at okfn.org>
>> *Subject: **Wikileaks Visualisation Project Proposal*
>>
>> Dear OKCon organizers,
>>
>> I'd like to propose a project for the OKCon 2011 hackday for the
>> visualisation of the Wikileaks Iraq War Diaries. I'm academically
>> involved in Machine Learning research and when the Wikileaks data was
>> published I gained the strong impression that this data contained
>> untapped potential in the form of hidden relationships, which could
>> be better understood using modern Machine Learning algorithms.
>> Several months ago I have therefore developed the prototype of an app
>> to show how such hidden relationships could be intuitively revealed
>> and visualised with Machine Learning tools.
>>
>> My app uses the full text of the reports from the Iraq War Diaries
>> which are written by the troops on the ground and encodes the text as
>> numerical vectors by simple counting of words. I use the t-SNE
>> algorithm (van der Maaten & Hinton, 2008) to project these
>> high-dimensional (about 10,000 dimensions) vectors into the
>> two-dimensional plane. The result is a "contextual map" of the data,
>> in which each report is represented by a point on the map. Reports
>> which use many common words are mapped close to one another.
>> Therefore clusters emerge that collect reports to distinct topics.
>> The data also contains categories for the reports such as "Explosive
>> Hazard", "Enemy Action" and "Criminal Event" which can be used to
>> color the points to encode more information. The interactive app
>> makes it possible to interact with the data and link the map back to
>> original text. Hovering over the points shows the titles of the
>> reports to quickly gain an impression of the topics in different
>> regions of the map. By clicking on the points, the user may view the
>> full report, including date and time, title, full text and see the
>> location of the event on a map. To better explore regions of the map,
>> the data can be zoomed and panned around. Also, above a certain zoom
>> level the five most relevant words for each report are shown beneath
>> the point.
>>
>> How this works can be better understood by watching this short
>> screencast of the app in use:
>> http://dl.dropbox.com/u/1599047/wikileaks-screencast.mov
>> The attached presentation explains the algorithm in more detail (and
>> somewhat more academically).
>>
>> This visualisation therefore provides a novel way of quickly looking
>> at such massive datasets as the Wikileaks data from "30,000 feet"
>> above. Size and category of the clusters provide an immediate
>> impression of the relative number of documents on each topic and
>> category. Hovering and viewing the titles provides quick access to
>> the types of documents in different clusters and regions. If a region
>> of interest has been found, it can be further zoomed in and
>> investigated by reading the full text reports.
>>
>> For a journalist receiving such a large dataset this provides a more
>> intuitive way of exploring the documents and their relationships.
>> Usually, the journalist would try to filter the data by dates/times,
>> categories or keywords to find interesting stories
>> (http://www.guardian.co.uk/news/datablog/2011/jan/31/wikileaks-data-journalism,
>> http://www.nytimes.com/2011/01/30/magazine/30Wikileaks-t.html). But
>> this requires that she already knows what she's looking for and has a
>> pre-conceived notion of what the data is about. The risk is therefore
>> that interesting and newsworthy stories are missed because they are
>> hidden in too much data like the forest among the trees. My approach
>> is an attempt to "let the data speak" first to get a high-level
>> overview of the data and aid serendipitous discovery of unexpected
>> documents. Of course, this approach can still be combined with
>> traditional filters and keyword searches.
>>
>> For example, my algorithm clusters many reports together that were
>> the source material for reports that the allied forces ignored
>> torture and abuse of detainees by Iraqi security forces
>> (http://www.guardian.co.uk/world/2010/oct/22/iraq-war-logs-military-leaks).
>> Even without a pre-conceived notion of relevant keywords, my app
>> would allowed to uncover this story in the data.
>>
>> So far, my app is a proof-of-concept prototype and only runs locally.
>> To generate some larger discussion among people involved in data
>> journalism, visualisation and Machine Learning research I would like
>> to to rewrite the app and create an interactive webpage to host the
>> it. OKCon would provide an ideal venue to build an improved and
>> web-ready version of the app during a hackday and then to seek some
>> discussion with people from the journalism world during the
>> conference. The goal would be to assemble a small team of maybe 5
>> people with technical skills and graphic design skills to finish a
>> reimplementation during the hackday. Non-technical people from the
>> journalism world would be welcome to help improve the design, ensure
>> usability and promote the webpage on the internet and in the media.
>> For the hackday, we would only need a small room for 5-10 people.
>> Ideally, the OKFN could also provide webhosting. During the
>> conference it might be good to have a small booth or space where the
>> results can be demonstrated and the developers can discuss the
>> project with visitors.
>>
>> I have presented my project to several senior Machine Learning
>> researchers who all support the idea and agree that it is a novel
>> application of a sophisticated algorithm in an area where such
>> methods have been rarely employed. This project would thus be a
>> chance to build bridges between the academic Machine Learning world
>> and the data journalism and data visualisation world. Given
>> successful completion of the project, there's the possibility of
>> publication in an academic journal, such as this upcoming special
>> issue on data visualisation
>> http://www.cc.gatech.edu/~lebanon/pub/cfp-10618-201103226.pdf
>> <http://www.cc.gatech.edu/%7Elebanon/pub/cfp-10618-201103226.pdf>
>>
>> About me:
>> I have studied Statistics and Computer Science at
>> Humboldt-Universität Berlin and the University of British Columbia,
>> focussing on Machine Learning research. Following my Master degree I
>> will begin a PhD in Computer Science at the University of Toronto,
>> working on Machine Learning applications in Genetics. I have
>> previously contributed to OKFN by helping out on the yourtopia.net
>> <http://yourtopia.net/> project with Guo Xu, Friedrich Lindenberg and
>> Dirk Heine.
>>
>> Best regards,
>> Hannes Bretschneider
>>
>> References:
>> L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional
>> Data Using t-SNE. Journal of Machine Learning
>> Research 9(Nov):2579-2605, 2008.
>> More info about t-SNE: http://homepage.tudelft.nl/19j49/t-SNE.html
>
>
>
>
> _______________________________________________
> OKCon-Programme mailing list
> OKCon-Programme at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okcon-programme
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/mailman/private/okcon-programme/attachments/20110524/41007273/attachment.htm>
More information about the OKCon-Programme
mailing list