[open-linguistics] How to represent LLOD diagram categories at datahub ?
Bettina Klimek
klimek at informatik.uni-leipzig.de
Fri Nov 22 10:54:32 UTC 2013
Dear all,
at the last telco I agreed to categorize all the data sets in the
datahub. In the telco we set up the following three categories:
Lexicon ? based on data containing words
Corpus ? based on data containing text
Linguistic Database ? based on data containing languages
After I looked up every single data set in the datahub several
problems occurred:
1) some data sets seem to be wrong/spam
2) the source URLs in some data sets do not work or are missing
3) there seem to be less data sets at datahub than there are bubbles
in the cloud diagram
4) some data sets are not linked yet
Concerning these issues I would like to know if there is someone
managing and maintaining the correctness and completeness of the data
set entries at datahub? I think this is a matter of quality that
should not be ignored.
5) a categorization in terms of the three categories defined above
does not work for all the data sets
While going through all the data sets manifold questions concerning
the LLOD cloud as such occurred to me. The most important one is: what
is linguistic data? It is not defined anywhere on the website and the
data sets reveal all kinds of data which differ in their information
provided, purpose of representation, data types they supply to the
user as well as in the linguistic content. To me it seemed as if the
categorization of the data sets is taking the second step without the
first. The first step should be to set up a definition of what
linguistic data is and as a result which kind of data sets are to be
expected as belonging into the cloud.
When thinking of the category labels I tried to find some that would
be broadly accepted by the users. However there seems to be no ideal
user. A user could be a linguist interested in human languages, a
computer linguist, a computer scientist or someone interested in NLP
(even more different users are thinkable). The data sets mirror the
diversity of users within the data they provide. I think it is
impossible to find category labels which can cover all the special
research areas linguistic data could be used for. That is why I tried
to establish categories everybody understands intuitively. Thus I have
assigned the data sets to the following 5-6 categories, which are
defined as follows:
LLOD data set categories:
1) Data representing words = Lexicon
2) Data representing texts = Corpus
3) Data representing information about languages = Language Database
4) Data derived derived from but conserving original data = Ontology
5) Data representing further knowledge about the data (such as
bibliographical references) = Meta Data
6) maybe: category for further relevant data sets not directly
concerned with linguistic data
As you can see I kept the first two categories because they are
straightforward and users intuitively know what a lexicon and a corpus
are. I established category #3 for those data sets dealing with
certain information on a large number of languages which cannot easily
be understood as lexicon or corpus. This holds for PHOIBLE and ASJP.
The former lists phonological segments that are below the word and
text level and which are provided for many languages. The latter is a
language classification based on lexical similarities (which are not
given though). Another case is glottolog being a language
classification as well but additionally providing relevant
bibliographical sources. Since these references are set up in a
distinct application (Langdoc) glottolog is one special case covering
two categories: Language Database and Meta Data. Another case, even
more diverse, is the Multext-East data set which seems to cover the
categories Lexicon, Corpus as well as Ontology. We can debate about
the assignment of a data set to two categories. Maybe it is better to
force everybody uploading data sets to select for only one default
category.
Furthermore the assignment of the data sets to the categories is based
on what is to be provided primarily! The ASIt for instance evokes the
impression that some kind of Italian corpora are the content of the
data set. However it is rather interested in presenting the analysis
of various corpora (which are not provided as such!) in form of
rdf-schema. Data sets like these are the reason why I set up category
#4. These are problematic because they are based on human language
data but converted into a machine readable language and by doing so
totally changing the user interested in the data. The linguist would
be interested in a user friendly searchable language corpus and the
computer linguist/scientist is only concerned with the rdf data for
instance. To solve this problem I assigned data sets primarily
providing human language data (even if there is also an rdf verision)
to the categories #1 to #3 and data sets being concerned with machine
language in any way (only using human language material in order to do
so) to category #4.
I think it is essential to put these definitions into the cloud (maybe
as tooltips), because the Meta Data category for instance will be
associated with different information depending on whether a user is
concerned with human language or with computer languages.
My classification of the data sets can be seen at:
https://docs.google.com/spreadsheet/ccc?key=0AkVaxylrRsewdGwtckloTlYyZ25iY2duTkQwdlZBaHc&usp=sharing
There you can also see some ?unsure data sets? such as DBpedia. I
think these are corresponding to the ?resources used to assist and
augment language processing applications, even if the nature of the
resource is not deeply entrenched in Linguistics, but only as long as
the usefulness is well motivated? as explained on:
http://www.semantic-web-journal.net/blog/call-multilingual-linked-open-data-mlod-2012-data-post-proceedings. I couldn?t come up with an appropriate category label for them yet (maybe ?indirectly relevant resources??!). And I am not sure if they should be in the cloud at all because the data sets are already so heterogeneous that adding not primarily linguistic data sets might cause even more confusion to the user and maybe also frustration when he won?t find only linguistic data. Keeping the LLOD cloud as a pure linguistic data cloud and providing a possibility to link to the LOD cloud (which already contains the ?unsure? data sets) could be a practical option
here.
This is just another proposal for a possible classification. I think
it is efficient enough to also cover data sets which are to come in
the future. I know it is very broad and not very specific, but since I
am of the opinion that further sub-classifications need to be
developed with regard to the different linguistic user types, we can
work on that more carefully later. As a first impression of the cloud
the 5-6 categories I proposed here should be sufficient to get an
overview of the main cloud content.
Looking forward to your comments,
Bettina
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
More information about the open-linguistics
mailing list