[okfn-hu] Planning for open data workshop in Budapest, 20th May

Béky Miklós miklos.beky at gmail.com
Wed May 4 20:10:43 UTC 2011


2011/5/3 Peter Gervai <grinapo at gmail.com>:

> It would require some better support than I've seen, since it's a
> multiphase work:
> * possible datasets should be registered (dataset name, provider,
> current means of access, current pricing, and everything which is
> easily known)
> * registered datasets require status updates (license if known, legal
> background and related laws, whether a restricted licence is legal(!)
> at all, which body controls or should control the release, who thinks
> they have copyright protection about the material, etc)
> * categorise whether the dataset is free and clear; or if it just
> requires acknowledgement from the control body; or if the status is
> troubled, like when the legal status should be free but some entity
> denies access, and by what grounds, since this all shows that heavy
> legal fight required; or if the legal free status is clear but the
> owners illegally ignore it and we should actually get ("steal") the
> data; or when the data is clearly not accessible legally; or when a
> data is not yet accessible but it is possible that the powers that be
> can be convinced otherwise

This approach makes sense! I deeply agree with it.

I also had problem on ckan.net with browsing data even if I did not
check the licences :) This is not specific to ckan however it is a
common content browsing problem for sites with growing datasets.

The problem is that when you try to get see through the site i mean
try find out what kind of data is there that could be interesting for
you, you have no chance with unstructured metadata (the groups are
fine, however above a certain amount of data it could be really hard
to maintain), and with 2000+ tags in alphabetical order it is also not
so easy to start.

Anyway what i would like to share with you is a simple tag cloud and
an algorithm behind it. It can be found on http://kereso.nda.hu/ after
a search and it is in Hungarian:

It takes thousands of keywords (tags) and of course it knows their
relations to the source articles, so it builds up a tag cloud from it.
The cloud not only shows the frequency of a tag in the dataset, but
the great idea is that is filters out the tags that can be found too
many times together in contents and structures the less frequent tag
below the chosen one. The chosen one is the one that can be found most
times together with others. For example if the algorithm finds
"literature" and "poem" together in contents, and later "literature"
and "novel" in another contents together, then it will filter out or
more precisely put "novel" and "poem" under the tag level of
"literature"'. So at the and on the top level you will see only a few
tags that are the less related to each other, however describe very
well the whole dataset, furthermore if you click a tag you will go one
level deeper in the tag cloud (well, it could be called a menu for
better understanding) to find tags that are mostly related to the
selected one but are again less related to each other. And so on... as
you see you won't need too many levels, not even for large datasets.

It's a shame on me but I cannot recall the person who showed me this
to give credits. I remember that Kangyal Andris was there too when we
talked about it, so I can ask him if anyone is interested in an
implementation or needs further info.

And I absolutely agree, it is really funny to talk about this in English :)

Cheers,
Miki




More information about the okfn-hu mailing list