[School-of-data] Introduction (myself and a corpus/datasets)

Alexandre Rafalovitch arafalov at gmail.com
Thu May 24 19:51:58 UTC 2012


Jessy,

I am glad you like it.

Some of the description is in the research paper on the website, but
your point is well taken. There is a need for more introductory
material. That has been on my todo list for a while. :-)

As to n-grams and other interesting things, that's part of the
'deeper' patterns. UN language is a curious beast that currently
confuses any market/research NLP software. With n-grams of n > 100, it
is no surprise. But yes, there is enough deep language research topics
in the corpus to keep a whole department busy (Machine translation,
deep parsing, terminology, sublanguages, graph-analysis, Named Entity
Recognition, text-generation, etc). I just did not introduce it in a
first email because this group's focus seem to be on data, not so much
on natural language. More background though at
http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, May 24, 2012 at 3:30 PM, Jessy Kate Schingler
<jessy at jessykate.com> wrote:
> this is really cool alex!
>
> it would be awesome if there was a summary of columns with basic description
> of each field, data types, and a few example rows, available on the main
> page so one could get a sense of the data without downloading and extracting
> it.
>
> another interesting embedded data set that comes to mind would be looking at
> correlations between yes/no votes and specific key words or n-grams.
>
> anyway, neat stuff!
>
> On Thu, May 24, 2012 at 11:56 AM, Alexandre Rafalovitch <arafalov at gmail.com>
> wrote:
>>
>> Hello,
>>
>> I am a software developer (Java, .Net, web technologies) with interest
>> in learning the skills of a data scientist. I have done some data
>> manipulation and read some books, but still have a very long road
>> ahead of me. So, in this context, I am a learner.
>>
>> I also have a corpus that I compiled and would love people to play
>> with. It is at http://www.uncorpora.org/ and contains (some of) the
>> resolutions of the General Assembly of the United Nations. The
>> resolutions are broken into paragraphs and are aligned on that level
>> for all six languages of UN (Arabic,  Chinese, English, French,
>> Russian, Spanish). The corpus also contains voting records for the
>> resolutions. The corpus is unrestricted for research purposes.
>>
>> The corpus is normally used for Machine Learning research because it
>> has Arabic as well as other languages that are hard to find in public
>> multilingual corpora. It is not actually large enough for real ML
>> training, so seem to be popular as a secondary/complimentary source.
>>
>> At the same time, there is a number of interesting datasets hiding in
>> the corpus, both mono- and multi-lingual. I am planning to work on
>> some of those, but will welcome anybody else doing it as well, or
>> perhaps even better than me.
>>
>> Examples of datasets that I had a very preliminary look at are:
>> *) Co-voting patterns for any country, where you can select a country
>> on a map and a second map shows color coding of how similar other
>> countries voted on those issues ("agree votes" - "against votes"). A
>> preliminary visualization is at:
>> http://www.outerthoughts.com/files/votes/test1.html
>> *) Evolution of translation of preambulatory/operative phrases one
>> sees in legal documents (e.g. "Recalling", "Requests", "Demands"). I
>> have done a preliminary analysis for English/Russian and the result
>> were much more complex than expected (I expected basically pairs of
>> entries). You can see a test visualization at:
>> http://www.outerthoughts.com/files/test_seadragon.html . I am hoping
>> to redo this one with interactive per-year breakdown, collapsing the
>> repeat indicators (such as "Recalling also" -> "Recalling", but not
>> "Recalling once again") and maybe picking up a language pair more
>> people can read (French? Spanish?).
>>
>> There are other datasets hiding in there, these are just easiest to
>> dig out. If anybody is interested, I would be happy to pursue the
>> topic further and in-depth either publicly or privately (e.g. for
>> joint work).
>>
>> Regards,
>>   Alex.
>>
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>> _______________________________________________
>> School-of-data mailing list
>> School-of-data at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/school-of-data
>
>
>
>
> --
> Jessy
> http://jessykate.com
>




More information about the school-of-data mailing list