[open-linguistics] Categories for data in the LLOD Cloud Diagram

Christian Chiarcos christian.chiarcos at web.de
Fri Jul 12 13:17:58 UTC 2013

> This seems to be a pretty strait forward kind of taxonomy. I like  
> straight forward.


> I would encourage a path forward where this sort of taxonomy development  
> would push on the OLAC community to revise or revisit the type  
> declaration recommendation for linguistic resources. You can see at the  
> link below there are only three resources.  
> http://www.language-archives.org/REC/type.html Perhaps these types as  
> defined by OLAC are not sufficient for the purposes here, but rather  
> than crating another taxonomy, why not alter one already in existence so  
> that many of the resources already curated in many meta-data systems can  
> immediately be typed?

Sure, at least both categorizations should be put in relation to each  
other, even though their goal and scope is quite different. The idea here  
is primarily to find an organization for the LLOD cloud diagram (unless  
anyone proposes to go beyond that, at least). For example, we also have to  
consider that this will be translated into a graph coloring of the LLOD  
diagram, so the number of groups should be small, and about equally sized.  
Furthermore, this only applies to potential LLOD resources, usually with a  
complex and formal structure, whereas OLAC also includes publications that  
are designed to be human-readable rather than machine-readable (say,  
books). So, it's a little like apples and pies.

Nevertheless, some ideas about aligning OLAC and the proposed categories:

~ Terminology and lexicon resources (tag: lexical)
	[we should probably adopt the OLAC term]

~ Translation Memories and Bitext (tag: bitext)
~ Annotated Corpora (tag: annotated-corpus)
~ Multimodal resources (tag: multimodal-corpus)
	[in the categorization we had in May last year,
	all of these were grouped together as "corpus",
	but I would advise not to adopt the OLAC term
	"primary_text", because here we only talk
	about text plus annotations or alignments
	(plain text is out of our scope), and
	the primary data is not necessarily textual]

~ Typological Databases (tag: typological)
	[considering something like WALS as a
	prototype,  this is a very rough equivalence,
	but actually, "language_description" may be
	a more appropriate term ?]

Of course, the OLAC metadata specifications  
(http://www.language-archives.org/OLAC/metadata.html) themselves could be  
an instance of this, but I don't think that "Metadata and linguistic  
categories (tag: linguistic-metadata)" fits any of the categories. In  
particular, it is not language_description. Rather, linguistic-metadata  
resources provide the vocabulary that can be used (within another  
resource) to express linguistic information. The actual description is  
*specified by the linking* and hence "between" LLOD bubbles rather than  
"within" any of them.

> I have two questions, which if someone can confirm or deny my  
> assertions, would help me determine if I am understanding the proposed  
> taxonomy.
> True :: A published journal paper discussing a grammatical feature of a  
> minority language would be typed as a bitext.

This category is primarily thought of for translation memories, i.e.,  
parallel text (usually without annotations). I am not sure whether it  
makes sense to represent linguistic papers as LLOD resources (i.e.,  
modelling them completely in RDF[a] rather than using a simple TEI-based  
XML format). Of course, we can embed information RDFa metadata (from any  
of the categories above) within HTML, but I wouldn't try to apply our  
classification to papers.

> True :: A website where one could look up characters used in any  
> orthography in the world would be typed as typological.

I have no strong intuitions with respect to this. Maybe, this could also  
be seen as a phonological/orthographical lexicon, in particular, if it  
contains additional information about the character (like, e.g.,  
http://de.wiktionary.org/wiki/%F0%92%80%AD). Any ideas from the  
typologists? Should we rename "typological" to "language description" to  
make that clearer?

Just my 5 cent.

