[School-of-data] Introduction (myself and a corpus/datasets)

Thu May 24 18:56:14 UTC 2012

Hello,

I am a software developer (Java, .Net, web technologies) with interest
in learning the skills of a data scientist. I have done some data
manipulation and read some books, but still have a very long road
ahead of me. So, in this context, I am a learner.

I also have a corpus that I compiled and would love people to play
with. It is at http://www.uncorpora.org/ and contains (some of) the
resolutions of the General Assembly of the United Nations. The
resolutions are broken into paragraphs and are aligned on that level
for all six languages of UN (Arabic,  Chinese, English, French,
Russian, Spanish). The corpus also contains voting records for the
resolutions. The corpus is unrestricted for research purposes.

The corpus is normally used for Machine Learning research because it
has Arabic as well as other languages that are hard to find in public
multilingual corpora. It is not actually large enough for real ML
training, so seem to be popular as a secondary/complimentary source.

At the same time, there is a number of interesting datasets hiding in
the corpus, both mono- and multi-lingual. I am planning to work on
some of those, but will welcome anybody else doing it as well, or
perhaps even better than me.

Examples of datasets that I had a very preliminary look at are:
*) Co-voting patterns for any country, where you can select a country
on a map and a second map shows color coding of how similar other
countries voted on those issues ("agree votes" - "against votes"). A
preliminary visualization is at:
http://www.outerthoughts.com/files/votes/test1.html
*) Evolution of translation of preambulatory/operative phrases one
sees in legal documents (e.g. "Recalling", "Requests", "Demands"). I
have done a preliminary analysis for English/Russian and the result
were much more complex than expected (I expected basically pairs of
entries). You can see a test visualization at:
http://www.outerthoughts.com/files/test_seadragon.html . I am hoping
to redo this one with interactive per-year breakdown, collapsing the
repeat indicators (such as "Recalling also" -> "Recalling", but not
"Recalling once again") and maybe picking up a language pair more
people can read (French? Spanish?).

There are other datasets hiding in there, these are just easiest to
dig out. If anybody is interested, I would be happy to pursue the
topic further and in-depth either publicly or privately (e.g. for
joint work).

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)