[School-of-data] Introduction (myself and a corpus/datasets)

Daniel Dietrich daniel.dietrich at okfn.org
Thu May 24 19:45:23 UTC 2012


Hi Alex,

great to hear from you. Your project looks awesome!

One project immediately comes to mind: did you know about:

http://www.undemocracy.com/


Daniel


 
On 24 May 2012, at 20:56, Alexandre Rafalovitch wrote:

> Hello,
> 
> I am a software developer (Java, .Net, web technologies) with interest
> in learning the skills of a data scientist. I have done some data
> manipulation and read some books, but still have a very long road
> ahead of me. So, in this context, I am a learner.
> 
> I also have a corpus that I compiled and would love people to play
> with. It is at http://www.uncorpora.org/ and contains (some of) the
> resolutions of the General Assembly of the United Nations. The
> resolutions are broken into paragraphs and are aligned on that level
> for all six languages of UN (Arabic,  Chinese, English, French,
> Russian, Spanish). The corpus also contains voting records for the
> resolutions. The corpus is unrestricted for research purposes.
> 
> The corpus is normally used for Machine Learning research because it
> has Arabic as well as other languages that are hard to find in public
> multilingual corpora. It is not actually large enough for real ML
> training, so seem to be popular as a secondary/complimentary source.
> 
> At the same time, there is a number of interesting datasets hiding in
> the corpus, both mono- and multi-lingual. I am planning to work on
> some of those, but will welcome anybody else doing it as well, or
> perhaps even better than me.
> 
> Examples of datasets that I had a very preliminary look at are:
> *) Co-voting patterns for any country, where you can select a country
> on a map and a second map shows color coding of how similar other
> countries voted on those issues ("agree votes" - "against votes"). A
> preliminary visualization is at:
> http://www.outerthoughts.com/files/votes/test1.html
> *) Evolution of translation of preambulatory/operative phrases one
> sees in legal documents (e.g. "Recalling", "Requests", "Demands"). I
> have done a preliminary analysis for English/Russian and the result
> were much more complex than expected (I expected basically pairs of
> entries). You can see a test visualization at:
> http://www.outerthoughts.com/files/test_seadragon.html . I am hoping
> to redo this one with interactive per-year breakdown, collapsing the
> repeat indicators (such as "Recalling also" -> "Recalling", but not
> "Recalling once again") and maybe picking up a language pair more
> people can read (French? Spanish?).
> 
> There are other datasets hiding in there, these are just easiest to
> dig out. If anybody is interested, I would be happy to pursue the
> topic further and in-depth either publicly or privately (e.g. for
> joint work).
> 
> Regards,
>   Alex.
> 
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> _______________________________________________
> School-of-data mailing list
> School-of-data at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/school-of-data





More information about the school-of-data mailing list