[School-of-data] Introduction (myself and a corpus/datasets)
daniel.dietrich at okfn.org
Thu May 24 19:45:23 UTC 2012
great to hear from you. Your project looks awesome!
One project immediately comes to mind: did you know about:
On 24 May 2012, at 20:56, Alexandre Rafalovitch wrote:
> I am a software developer (Java, .Net, web technologies) with interest
> in learning the skills of a data scientist. I have done some data
> manipulation and read some books, but still have a very long road
> ahead of me. So, in this context, I am a learner.
> I also have a corpus that I compiled and would love people to play
> with. It is at http://www.uncorpora.org/ and contains (some of) the
> resolutions of the General Assembly of the United Nations. The
> resolutions are broken into paragraphs and are aligned on that level
> for all six languages of UN (Arabic, Chinese, English, French,
> Russian, Spanish). The corpus also contains voting records for the
> resolutions. The corpus is unrestricted for research purposes.
> The corpus is normally used for Machine Learning research because it
> has Arabic as well as other languages that are hard to find in public
> multilingual corpora. It is not actually large enough for real ML
> training, so seem to be popular as a secondary/complimentary source.
> At the same time, there is a number of interesting datasets hiding in
> the corpus, both mono- and multi-lingual. I am planning to work on
> some of those, but will welcome anybody else doing it as well, or
> perhaps even better than me.
> Examples of datasets that I had a very preliminary look at are:
> *) Co-voting patterns for any country, where you can select a country
> on a map and a second map shows color coding of how similar other
> countries voted on those issues ("agree votes" - "against votes"). A
> preliminary visualization is at:
> *) Evolution of translation of preambulatory/operative phrases one
> sees in legal documents (e.g. "Recalling", "Requests", "Demands"). I
> have done a preliminary analysis for English/Russian and the result
> were much more complex than expected (I expected basically pairs of
> entries). You can see a test visualization at:
> http://www.outerthoughts.com/files/test_seadragon.html . I am hoping
> to redo this one with interactive per-year breakdown, collapsing the
> repeat indicators (such as "Recalling also" -> "Recalling", but not
> "Recalling once again") and maybe picking up a language pair more
> people can read (French? Spanish?).
> There are other datasets hiding in there, these are just easiest to
> dig out. If anybody is interested, I would be happy to pursue the
> topic further and in-depth either publicly or privately (e.g. for
> joint work).
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working. (Anonymous - via GTD
> School-of-data mailing list
> School-of-data at lists.okfn.org
More information about the school-of-data