[School-of-data] Introduction (myself and a corpus/datasets)

Thu May 24 19:30:09 UTC 2012

this is really cool alex!

it would be awesome if there was a summary of columns with basic
description of each field, data types, and a few example rows, available on
the main page so one could get a sense of the data without downloading and
extracting it.

another interesting embedded data set that comes to mind would be looking
at correlations between yes/no votes and specific key words or n-grams.

anyway, neat stuff!

On Thu, May 24, 2012 at 11:56 AM, Alexandre Rafalovitch
<arafalov at gmail.com>wrote:

> Hello,
>
> I am a software developer (Java, .Net, web technologies) with interest
> in learning the skills of a data scientist. I have done some data
> manipulation and read some books, but still have a very long road
> ahead of me. So, in this context, I am a learner.
>
> I also have a corpus that I compiled and would love people to play
> with. It is at http://www.uncorpora.org/ and contains (some of) the
> resolutions of the General Assembly of the United Nations. The
> resolutions are broken into paragraphs and are aligned on that level
> for all six languages of UN (Arabic,  Chinese, English, French,
> Russian, Spanish). The corpus also contains voting records for the
> resolutions. The corpus is unrestricted for research purposes.
>
> The corpus is normally used for Machine Learning research because it
> has Arabic as well as other languages that are hard to find in public
> multilingual corpora. It is not actually large enough for real ML
> training, so seem to be popular as a secondary/complimentary source.
>
> At the same time, there is a number of interesting datasets hiding in
> the corpus, both mono- and multi-lingual. I am planning to work on
> some of those, but will welcome anybody else doing it as well, or
> perhaps even better than me.
>
> Examples of datasets that I had a very preliminary look at are:
> *) Co-voting patterns for any country, where you can select a country
> on a map and a second map shows color coding of how similar other
> countries voted on those issues ("agree votes" - "against votes"). A
> preliminary visualization is at:
> http://www.outerthoughts.com/files/votes/test1.html
> *) Evolution of translation of preambulatory/operative phrases one
> sees in legal documents (e.g. "Recalling", "Requests", "Demands"). I
> have done a preliminary analysis for English/Russian and the result
> were much more complex than expected (I expected basically pairs of
> entries). You can see a test visualization at:
> http://www.outerthoughts.com/files/test_seadragon.html . I am hoping
> to redo this one with interactive per-year breakdown, collapsing the
> repeat indicators (such as "Recalling also" -> "Recalling", but not
> "Recalling once again") and maybe picking up a language pair more
> people can read (French? Spanish?).
>
> There are other datasets hiding in there, these are just easiest to
> dig out. If anybody is interested, I would be happy to pursue the
> topic further and in-depth either publicly or privately (e.g. for
> joint work).
>
> Regards,
>   Alex.
>
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
> _______________________________________________
> School-of-data mailing list
> School-of-data at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/school-of-data
>

-- 
Jessy
http://jessykate.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/school-of-data/attachments/20120524/e4c5d884/attachment-0001.html>