[School-of-data] Introduction (myself and a corpus/datasets)

Alexandre Rafalovitch arafalov at gmail.com
Thu May 24 20:44:54 UTC 2012


Thanks Daniel,

I am well aware of UNDemocracy, but unfortunately that project did not
get a hockey-stick momentum and the creator moved on to other big
things (https://scraperwiki.com/ I believe). Our approaches are
different, he was working from available data (obtained by any means),
while I am working from authoritative data (obtained with
permissions). Plus his focus is on navigation and discovery and mine
is on multi-lingual aspects and bulk analysis.

That's nit-picking however and on a higher-level we are aligned. :-)

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, May 24, 2012 at 3:45 PM, Daniel Dietrich
<daniel.dietrich at okfn.org> wrote:
> Hi Alex,
>
> great to hear from you. Your project looks awesome!
>
> One project immediately comes to mind: did you know about:
>
> http://www.undemocracy.com/
>
>
> Daniel
>
>
>
> On 24 May 2012, at 20:56, Alexandre Rafalovitch wrote:
>
>> Hello,
>>
>> I am a software developer (Java, .Net, web technologies) with interest
>> in learning the skills of a data scientist. I have done some data
>> manipulation and read some books, but still have a very long road
>> ahead of me. So, in this context, I am a learner.
>>
>> I also have a corpus that I compiled and would love people to play
>> with. It is at http://www.uncorpora.org/ and contains (some of) the
>> resolutions of the General Assembly of the United Nations. The
>> resolutions are broken into paragraphs and are aligned on that level
>> for all six languages of UN (Arabic,  Chinese, English, French,
>> Russian, Spanish). The corpus also contains voting records for the
>> resolutions. The corpus is unrestricted for research purposes.
>>
>> The corpus is normally used for Machine Learning research because it
>> has Arabic as well as other languages that are hard to find in public
>> multilingual corpora. It is not actually large enough for real ML
>> training, so seem to be popular as a secondary/complimentary source.
>>
>> At the same time, there is a number of interesting datasets hiding in
>> the corpus, both mono- and multi-lingual. I am planning to work on
>> some of those, but will welcome anybody else doing it as well, or
>> perhaps even better than me.
>>
>> Examples of datasets that I had a very preliminary look at are:
>> *) Co-voting patterns for any country, where you can select a country
>> on a map and a second map shows color coding of how similar other
>> countries voted on those issues ("agree votes" - "against votes"). A
>> preliminary visualization is at:
>> http://www.outerthoughts.com/files/votes/test1.html
>> *) Evolution of translation of preambulatory/operative phrases one
>> sees in legal documents (e.g. "Recalling", "Requests", "Demands"). I
>> have done a preliminary analysis for English/Russian and the result
>> were much more complex than expected (I expected basically pairs of
>> entries). You can see a test visualization at:
>> http://www.outerthoughts.com/files/test_seadragon.html . I am hoping
>> to redo this one with interactive per-year breakdown, collapsing the
>> repeat indicators (such as "Recalling also" -> "Recalling", but not
>> "Recalling once again") and maybe picking up a language pair more
>> people can read (French? Spanish?).
>>
>> There are other datasets hiding in there, these are just easiest to
>> dig out. If anybody is interested, I would be happy to pursue the
>> topic further and in-depth either publicly or privately (e.g. for
>> joint work).
>>
>> Regards,
>>   Alex.
>>
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>> _______________________________________________
>> School-of-data mailing list
>> School-of-data at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/school-of-data
>




More information about the school-of-data mailing list