[Data-Driven-Journalism] Visualizing the Wikileaks Iraq War Diaries: Call for Participation

Hannes Bretschneider habretschneider at gmail.com
Mon May 30 09:56:58 UTC 2011

I am looking for Python, Javascript and other developers who want to
join me in building an interactive webapp to visualize the Wikileaks
Iraq War Diaries during the pre-OKCon 2011 (Open Knowledge Conference)
hackdays in Berlin, June 27th and 28th!
(Excuses for cross-posting)

What is it?
The app will be a working demonstration how Machine Learning and Data
Visualization may help journalists who are looking to discover
newsworthy stories in large document dumps. The app will be a website
that offers a full-text visualization of the Wikileaks data. Using
Machine Learning algorithms we will build a "contextual map" of the
Iraq war reports in which each report is represented as a point in a
cloud. Similar reports will be placed close together, such that
clusters of reports emerge that all deal with similar
topics. Additional metadata from the reports will be used to, for
example, color the points according to their category and use the
geodata in the reports to show the location of an event on a map.

The user will be able to click the points to view the full reports,
zoom in, pan around and filter by keywords to explore the data.

You want to do what?!?
Have a look at this screencast that I have previously recorded:

The cloud of points represents the documents. The different colors
represent categories of reports such as "Enemy Action", "Criminal
Event" and "Explosive Hazard". Because documents within a given
category use similar words, they are often grouped together. The user
can be seen hovering over the points with the mouse to view the report
titles. Clicking a report shows the full text including the date and
time of the incident and a map of where it took place. Zooming in and
panning around allows to explore individual regions in more
detail. Above a certain zoom level, the five words that are the most
characteristic for a document are displayed below the point.

What's the goal?
The goal is to have a two-day sprint at the pre-OKCon Hackdays to
build a website that reimplements my prototype (which is only a local
app) using web technologies. If possible we will also implement new
features (such as keyword filters) and use additional data (the
prototype only uses a subset of the Iraq War Diaries). During the
conference we will get a speaking slot to present our result and thus
have a great chance to generate some attention for our project, as
well as establish a dialogue with the data journalism community!

What's so interesting about that?
This project will combine sophisticated Machine Learning algorithms
with modern visualization technology for the web and apply it to a
topic of very high public interest.

Journalism does not traditionally use quantitative and statistical
methods like Machine Learning. But document dumps like Wikileaks are
likely to become a more important part of journalistic work in the
future and there is a great opportunity to help journalists cope with
the deluge of data by making sophisticated Machine Learning tools
accessible for non-experts.

Given successful completion of the project, there's the possibility of
publication in an academic journal, such as this upcoming special
issue on data visualisation

I'm interested! Give me a bit more background!
I'm academically involved in Machine Learning research and when the
Wikileaks data was published I gained the strong impression that this
data contained untapped potential in the form of hidden relationships,
which could be better understood using modern Machine Learning
algorithms. Several months ago I have therefore developed the
prototype of an app to show how such hidden relationships could be
intuitively revealed and visualised with Machine Learning tools.

My app uses the full text of the reports from the Iraq War Diaries
which are written by the troops on the ground and encodes the text as
numerical vectors by simple counting of words. I use the t-SNE
algorithm (van der Maaten & Hinton, 2008) to project these
high-dimensional (about 10,000 dimensions) vectors into the
two-dimensional plane. The result is a "contextual map" of the data,
in which each report is represented by a point on the map. Reports
which use many common words are mapped close to one another. Therefore
clusters emerge that collect reports to distinct topics.

This visualisation therefore provides a novel way of quickly looking
at such massive datasets as the Wikileaks data from "30,000 feet"
above. Size and category of the clusters provide an immediate
impression of the relative number of documents on each topic and
category. Hovering and viewing the titles provides quick access to the
types of documents in different clusters and regions. If a region of
interest has been found, it can be further zoomed in and investigated
by reading the full text reports.

For a journalist receiving such a large dataset this provides a more
intuitive way of exploring the documents and their
relationships. Usually, the journalist would try to filter the data by
dates/times, categories or keywords to find interesting stories
http://www.nytimes.com/2011/01/30/magazine/30Wikileaks-t.html). But
this requires that she already knows what she's looking for and has a
pre-conceived notion of what the data is about. The risk is therefore
that interesting and newsworthy stories are missed because they are
hidden in too much data like the forest among the trees. My approach
is an attempt to "let the data speak" first to get a high-level
overview of the data and aid serendipitous discovery of unexpected
documents. Of course, this approach can still be combined with
traditional filters and keyword searches.

For example, my algorithm clusters many reports together that were the
source material for reports that the allied forces ignored torture and
abuse of detainees by Iraqi security forces
without a pre-conceived notion of relevant keywords, my app would have
allowed to uncover this story in the data.

Who are you looking for?
I'd be happy to hear from people with technical skills in one or more
of the following areas:

 - Web Frameworks (I am proposing to use a Python framework,
   i.e. Django or Flask, but if you're a master of Ruby on Rails, I'm
   not opposed)

 - Data Visualization on the Web (so far, the app uses Processing and
   I favour ProcessingJS for the web implementation. Any Javascript
   gurus are highly welcome!)

 - Databases (I use MySQL in Python, wrapped into SQLAlchemy)

 - Machine Learning, Natural Language Processing and Numerical
   Computing (Python, Numpy, C++)

 - Web Design (HTML(5), CSS, etc... I'm sure there are people who can
   make this MUCH better looking than my prototype)

 - Web Server Administration (Linux, Apache, etc. possibly Amazon EC2
   or similar)

But I'm not a programmer!
This is mainly a project for people with technical skills, but there
are a few other areas, where we could need help:

  - (Data) Journalists: I'd love to get in touch with some journalists
    who would give some feedback, test the app and maybe help with
    writing texts and establishing a dialogue with the journalism

  - Graphic Designers: Even if you don't code, you'd be welcome to
    help with design.

Who are YOU, anyway?
I have studied Statistics and Computer Science at Humboldt-Universität
Berlin and the University of British Columbia, focusing on Machine
Learning research. Following my Master degree I will begin a PhD in
Computer Science at the University of Toronto, working on Machine
Learning applications in Genetics. I have previously contributed to
OKFN by helping out on the yourtopia.net project.

I'm excited! Where do I sign up?
Please send me an email at habretschneider at gmail.com, and let me know
what your skills are and how you'd like to contribute. Then we'll get
together a team in time for the OKCon hackdays.

References: L.J.P. van der Maaten and G.E. Hinton. Visualizing
High-Dimensional Data Using t-SNE. Journal of Machine Learning
Research 9(Nov):2579-2605, 2008.
More info about t-SNE: http://homepage.tudelft.nl/19j49/t-SNE.html

Looking to hear from you!
Best regards,
Hannes Bretschneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-driven-journalism/attachments/20110530/91b7a091/attachment.html>

More information about the data-driven-journalism mailing list