[open-linguistics] LLOD cloud categories

Fri Mar 28 21:13:29 UTC 2014

As mentioned in the last email, Bettina has summarized our discussions and
some feedback from LIDER project members and developed a small ontology of
linguistic categories.

Personally, I think it reflects relatively faithfully what we discussed
before. The main difference as compared to the classification in the
current diagram (LEXICON, LANGUAGE_DESCRIPTION, CORPUS, also see
http://wiki.okfn.org/Llod-categories) is that the "language description"
group is broken up, namely in "linguistic (language) data bases" and
"ontology", and that, in addition, we have an explicit "other" category.

I would, however, suggest to replace the label "ontology" with "linguistic
vocabulary" (i.e. a vocabulary of linguistically relevant terms). This is
because most of our resources (and practically every lexical-semantic
resource) are ontologies in a technical sense.

Furthermore, "Linguistic Data Category" should be labelled "Linguistic
Resource Type".

Beyond these marginal adjustments, I see three potential problems with this
classification:

- The LEXICON group has grown over-proportionally large since the last
diagram. We arrive at a more balanced picture if general knowledge bases
(DBpedia, Yago, Freebase -- unlike lexicons, they do not provide
grammatical, i.e., linguistic information in a strict sense) are singled
out as, say "semantic knowledge bases". This would solve our controversy
as to whether these resources are actually linguistic in nature (they
are certainly linguistically/NLP-relevant).

- The diagram contains bibliographical DBs as linguistically relevant data
sets. They cannot be assigned to category other than "other" but should
probably receive a more consistent treatment. Formerly, these have been
"language description" (because they describe where to locate language
data).

- Splitting the old LANGUAGE_DESCRIPTION (which was relatively small in
the first place) into three sub-categories results in tiny clusters and
thereby marginalizes non-lexical data sets. From a presentational point
of view, this is clearly not desirable.

Any thoughts?

Best,
Christian
-- 
Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931