[open-linguistics] Categories for data in the LLOD Cloud Diagram
christian.chiarcos at web.de
Fri Jul 12 13:17:58 UTC 2013
> This seems to be a pretty strait forward kind of taxonomy. I like
> straight forward.
> I would encourage a path forward where this sort of taxonomy development
> would push on the OLAC community to revise or revisit the type
> declaration recommendation for linguistic resources. You can see at the
> link below there are only three resources.
> http://www.language-archives.org/REC/type.html Perhaps these types as
> defined by OLAC are not sufficient for the purposes here, but rather
> than crating another taxonomy, why not alter one already in existence so
> that many of the resources already curated in many meta-data systems can
> immediately be typed?
Sure, at least both categorizations should be put in relation to each
other, even though their goal and scope is quite different. The idea here
is primarily to find an organization for the LLOD cloud diagram (unless
anyone proposes to go beyond that, at least). For example, we also have to
consider that this will be translated into a graph coloring of the LLOD
diagram, so the number of groups should be small, and about equally sized.
Furthermore, this only applies to potential LLOD resources, usually with a
complex and formal structure, whereas OLAC also includes publications that
are designed to be human-readable rather than machine-readable (say,
books). So, it's a little like apples and pies.
Nevertheless, some ideas about aligning OLAC and the proposed categories:
~ Terminology and lexicon resources (tag: lexical)
[we should probably adopt the OLAC term]
~ Translation Memories and Bitext (tag: bitext)
~ Annotated Corpora (tag: annotated-corpus)
~ Multimodal resources (tag: multimodal-corpus)
[in the categorization we had in May last year,
all of these were grouped together as "corpus",
but I would advise not to adopt the OLAC term
"primary_text", because here we only talk
about text plus annotations or alignments
(plain text is out of our scope), and
the primary data is not necessarily textual]
~ Typological Databases (tag: typological)
[considering something like WALS as a
prototype, this is a very rough equivalence,
but actually, "language_description" may be
a more appropriate term ?]
Of course, the OLAC metadata specifications
(http://www.language-archives.org/OLAC/metadata.html) themselves could be
an instance of this, but I don't think that "Metadata and linguistic
categories (tag: linguistic-metadata)" fits any of the categories. In
particular, it is not language_description. Rather, linguistic-metadata
resources provide the vocabulary that can be used (within another
resource) to express linguistic information. The actual description is
*specified by the linking* and hence "between" LLOD bubbles rather than
"within" any of them.
> I have two questions, which if someone can confirm or deny my
> assertions, would help me determine if I am understanding the proposed
> True :: A published journal paper discussing a grammatical feature of a
> minority language would be typed as a bitext.
This category is primarily thought of for translation memories, i.e.,
parallel text (usually without annotations). I am not sure whether it
makes sense to represent linguistic papers as LLOD resources (i.e.,
modelling them completely in RDF[a] rather than using a simple TEI-based
XML format). Of course, we can embed information RDFa metadata (from any
of the categories above) within HTML, but I wouldn't try to apply our
classification to papers.
> True :: A website where one could look up characters used in any
> orthography in the world would be typed as typological.
I have no strong intuitions with respect to this. Maybe, this could also
be seen as a phonological/orthographical lexicon, in particular, if it
contains additional information about the character (like, e.g.,
http://de.wiktionary.org/wiki/%F0%92%80%AD). Any ideas from the
typologists? Should we rename "typological" to "language description" to
make that clearer?
Just my 5 cent.
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
More information about the open-linguistics