[open-linguistics] How to represent LLOD diagram categories at datahub ?

Bettina Klimek bettina.klimek at uni-leipzig.de
Fri Oct 11 20:37:08 UTC 2013


Dear all,

I was thinking about how to categorize the data sets in the LLOD cloud  
as well. To me, a classification should be oriented on the people who  
are particularly interested in the data: linguists. Therefore it makes  
sense to find categories which are broad enough to cover 5-6 category  
labels in order to get a first holistic overview over the kinds of  
data being in the cloud, and narrow enough to allow for an exhaustive  
and unambiguous classification. That way linguists are capable of  
finding the kind of data sets they are looking for at one sight and  
one could avoid having mixed categories which make it hard to assign  
data sets to come to a certain category.

I agree with everyone that the third category ?language description?  
is too broad and includes ? as Christian already mentioned ? various  
kinds of data, which would create a somewhat fuzzy category. Besides,  
linguists have a different understanding of ?language description? as  
Sebastian (N) pointed out. The idea of setting up definitions for the  
categories seems very useful. In the context of the LLOD cloud  
however, I think that reusing already existing definitions is  
problematic, because they might not serve the specific needs of the  
field of linguistic data. Following Sebastian?s (H) idea, establishing  
our own definitions would be a good way of creating a coherent and  
homogenous classification of different linguistic data sets. This is  
what I tried to establish and what you can see on this Google  
document:  
https://docs.google.com/document/d/1skUbkYlM5Y6UiettCj7-hImdKandl3TsqFDiVTRlthE/edit?usp=drive_web.

The 5 categories I propose here are well known in the field of  
linguistics and might be the kinds of data a linguist might like to  
work with. I also introduced some subcategories here, because it is  
obvious that there are many more kinds of data and not each data type  
can be highlighted with a color in the cloud. I tried to solve this  
problem by assuming that these 5 categories could be treated as  
default categories to which each data set must be assigned at the  
highest level. The subcategories here can be extended and adjusted to  
any data set depending on the data it represents. Beyond that the  
subcategories are also a means for the people who would like to  
contribute their data set to the cloud, because they know best whether  
their data is a database or a corpus. Establishing a second category  
layer under the default categories will also lead to a finer grained  
classification. I do not know if it is possible, but I can imagine  
that the subcategories are visualized in a sub-cloud as well. That  
means in detail, if someone clicks on the database category for  
example a new cloud will be opened showing all data sets which include  
databases only and these bubbles could be colored as well according to  
the subcategories ?lexical database?, ?typological database? and so  
on. That way the data cloud stays really open, because the subcategory  
layer can be extended with new types of data sets. At the same time  
the main cloud can keep the default categories, because they include  
all the subcategories. From a linguist point of view this seems really  
useful since a typologist for instance might not want to see all data  
sets containing databases but all data sets including only typological  
databases.

Up to now this is only a first draft and it will have to be adjusted.  
Right now I am going through all data sets in the cloud to find out  
what kinds of data exist and if this classification could work out. I  
am happy to hear your opinions for improvement.

With kind regards,
Bettina



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.






More information about the open-linguistics mailing list