[open-linguistics] How to represent LLOD diagram categories at datahub ?

Sat Oct 12 17:02:51 UTC 2013

Dear Bettina,

thank you very much for your initiative, and in particular, your offer to  
actually work your way through the current data sets. If you can attend  
the telco on Monday, it would be great to see where this goes, and maybe  
come to a joint recommendation with respect to a set of categories that  
you can start working with.

I like your classification, and the argument of helping linguists to find  
resources in the LLOD cloud should certainly be the primary guideline of  
our decisions. But could I ask you to add it to  
http://wiki.okfn.org/Llod-categories so that we all work on the same  
document ?

Wrt. the classes themselves, I think that raw data is not really necessary  
for the LLOD cloud (though it very well is for the Working Group) unless  
anyone suggests to put raw data actually on the cloud. Using, say, NIF,  
that would be possible, but without actual annotations (=> corpus) or  
other additional information (=> database ?), we would not gain anything  
by representing the data in RDF, would we?

With respect to databases, we should probably refer to it as "linguistic  
databases". Otherwise, bibliographical data and general knowledge bases  
(think of DBpedia) would be covered by the term to.

With respect to "Meta Data", I would like to distinguish bibliographical  
data (as all your subcategories seem to be) from terminology repositories  
(your "linguistic topics", say, lexvo, ISOcat). This is partly because  
there is a large amount of bibliography data (which we assume to be  
linguistic by nature, even if not created by linguists, as it is about  
language in any case) available, but also because the function is very  
different: A resource in a bibliographical database provides a pointer to  
a resource, a terminology repository provides identifiers that other  
resources may refer to. As I understand it, Glottolog combines both  
aspects, but the bibliographical component is also referred to  
individually as Langdoc, right?

In general, I'm a little bit pessimistic that we will ever arrive at truly  
shared definitions, but I'm not sure whether we actually have to. We can  
simply provide an informal description and an extensional definition  
consisting of selected examples for resources for each group, maybe ranked  
with respect to their popularity (i.e., number of resources linked with  
it). In this way, there won't be terminological warfare, and if you like  
to see a particular resource as a definitory example, then just add more  
links to it! If all of this is generated out of tags (and not a fixed  
attribute), then it is also easy to redefine the visualization by chosing  
a different set of tags for visualization.

The reason here is that different communities are involved, with very  
different intuitions, but similar terms. And we need to stick to popular  
and apparently intuitive terms such as "lexicon" and "corpus" as a means  
of communication with linguists and to promote our work. That's also the  
motivation to adopt existing terminologies to the extent possible. But any  
classification arising out of this will be deficient, in any case, so it  
can approximate the linguistic or technical reality of terms only to a  
very limited extend.
As an example, I would never have thought of DBpedia as being a "lexicon"  
in any strict sense; any lexicographer would immediately refuse this idea,  
but it is used like one in NLP: URIs provide "sense" identifiers and each  
sense identifier is coupled with (something like) a definition. In this  
way, it certainly is some kind of "lexical-semantic resource" and  
"lexicon" was the shorthand we discussed to use for these. I recently had  
a conversation with a terminologist criticising my use of "terminology  
repository" for ISOcat for very similar reasons (it does provide  
information about terminology, but not the information and the level of  
detail a terminologist would expect; her notion of "terminology  
repository" would actually be an instance of "lexicon" as we used it for  
the September diagram). And she was right, of course. (I guess with the  
term "meta data" there are similar problems, but at the moment, this would  
be my preferred label for that category.)

But let's talk about it on Monday.

Best,
Christian

On Fri, 11 Oct 2013 22:37:08 +0200, Bettina Klimek  
<bettina.klimek at uni-leipzig.de> wrote:

> Dear all,
>
> I was thinking about how to categorize the data sets in the LLOD cloud  
> as well. To me, a classification should be oriented on the people who  
> are particularly interested in the data: linguists. Therefore it makes  
> sense to find categories which are broad enough to cover 5-6 category  
> labels in order to get a first holistic overview over the kinds of data  
> being in the cloud, and narrow enough to allow for an exhaustive and  
> unambiguous classification. That way linguists are capable of finding  
> the kind of data sets they are looking for at one sight and one could  
> avoid having mixed categories which make it hard to assign data sets to  
> come to a certain category.
>
> I agree with everyone that the third category ?language description? is  
> too broad and includes ? as Christian already mentioned ? various kinds  
> of data, which would create a somewhat fuzzy category. Besides,  
> linguists have a different understanding of ?language description? as  
> Sebastian (N) pointed out. The idea of setting up definitions for the  
> categories seems very useful. In the context of the LLOD cloud however,  
> I think that reusing already existing definitions is problematic,  
> because they might not serve the specific needs of the field of  
> linguistic data. Following Sebastian?s (H) idea, establishing our own  
> definitions would be a good way of creating a coherent and homogenous  
> classification of different linguistic data sets. This is what I tried  
> to establish and what you can see on this Google document:  
> https://docs.google.com/document/d/1skUbkYlM5Y6UiettCj7-hImdKandl3TsqFDiVTRlthE/edit?usp=drive_web.
>
> The 5 categories I propose here are well known in the field of  
> linguistics and might be the kinds of data a linguist might like to work  
> with. I also introduced some subcategories here, because it is obvious  
> that there are many more kinds of data and not each data type can be  
> highlighted with a color in the cloud. I tried to solve this problem by  
> assuming that these 5 categories could be treated as default categories  
> to which each data set must be assigned at the highest level. The  
> subcategories here can be extended and adjusted to any data set  
> depending on the data it represents. Beyond that the subcategories are  
> also a means for the people who would like to contribute their data set  
> to the cloud, because they know best whether their data is a database or  
> a corpus. Establishing a second category layer under the default  
> categories will also lead to a finer grained classification. I do not  
> know if it is possible, but I can imagine that the subcategories are  
> visualized in a sub-cloud as well. That means in detail, if someone  
> clicks on the database category for example a new cloud will be opened  
> showing all data sets which include databases only and these bubbles  
> could be colored as well according to the subcategories ?lexical  
> database?, ?typological database? and so on. That way the data cloud  
> stays really open, because the subcategory layer can be extended with  
> new types of data sets. At the same time the main cloud can keep the  
> default categories, because they include all the subcategories. From a  
> linguist point of view this seems really useful since a typologist for  
> instance might not want to see all data sets containing databases but  
> all data sets including only typological databases.
>
> Up to now this is only a first draft and it will have to be adjusted.  
> Right now I am going through all data sets in the cloud to find out what  
> kinds of data exist and if this classification could work out. I am  
> happy to hear your opinions for improvement.
>
> With kind regards,
> Bettina
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
>
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>

-- 
Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931