[open-linguistics] How to represent LLOD diagram categories at datahub ?

Mon Nov 25 18:14:51 UTC 2013

Dear Bettina, dear John,

first of all, thanks to Bettina for her review of the data sets. I  
probably missed a part of the conversation, but please find my two cents  
below:

>>     LLOD data set categories:
>>     1) Data representing words = Lexicon
>>     2) Data representing texts = Corpus
>>     3) Data representing information about languages = Language Database
>>     4) Data derived derived from but conserving original data = Ontology
>>
>> Â How is this an "ontology"? I would call it something like
>> "statistical data"
>
...
>
>>     5) Data representing further knowledge about the data (such as
>>     bibliographical references) = Meta Data
>>     6) maybe: category for further relevant data sets not directly
>>     concerned with linguistic data

On #1: Given the fact that we now have FrameNet (through uby), and that  
some of our lexicons do not really deal with words, but more with  
concepts, we should probably extend the definition to information beyond  
words:

#1 "information about words and semantic concepts"

On #2 vs. #4: I feel that an LLOD definition of "corpus" should allow for  
a standoff solution, i.e., a corpus where the annotations are provided,  
but the source data isn't (this is to be expected as the ultimate goal of  
a developmental trajectory that we see manifested in generalized formats  
such as GrAF). Actually, NIF explicitly allows us to do so. But with only  
annotations being in RDF and the original text in, say, HTML or TEI,  
possibly with some RDFa attributes, the linked open data component does  
not really *represent* the text, but only the linguistic structure of a  
text. The should remain valid corpora, but according to your definition,  
they would become #4, because they may converve parts of the original data  
(e.g., lemma annotation). So, I would suggest to rephrase #2 as

#2 "information about texts and their linguistic organization"

On #3 vs. #4: A similar problem persists with respect to ASJP: This would  
be derived information, hence #4? But what does derivation actually mean  
here?

On #4 vs. #6: Honestly, I'm not too happy about #4. Neither the name nor  
the definition. Maybe rename the category to "other" for the time being ?  
Dissatisfying, of course, but possibly the most achievable compromise. The  
above-mentioned problems would fade away, and we would avoid the slightly  
dissatisfying definition of the "non-linguistic" category #6.

On #3 vs. #5: If we single out information about language as #3 (which I  
would support, if we have enough datasets), then we should also  
distinguish purely bibliographical information from linguistically  
relevant information. These are included, even if not concerned with  
linguistic literature, because they point to possible data sources for  
linguistic information, but they also represent an old, rich and growing  
group of LOD data sets which are clearly distinct from everything else  
we're dealing with.

As a minor remark on glottolog, I would clearly see it more on the side of  
#5 rather than #3, but eventually, this is up to the data provider to be  
decided (I would suggest, at least). So, any ideas on this, Sebastian ?

To sum up (with new numbers, to avoid confusion):

a) LEXICON: "information about words and semantic concepts" (=#1)
b) CORPUS: "information about texts and their linguistic organization"  
(=#2)
    (footnote: "includes text fragments, e.g., to account for corpora  
composed out of scrambled, isolated sentences")
c) LANGUAGE_DATABASE: "information about languages" (= #3)
d) REFERENCE: "information about linguistic data sources" (mostly  
bibliography, but we can include the LREMap and tool repositories, here,  
too) (bibliographical part of #5)
e) METADATA: "information about the description of linguistic data"  
(non-bibliographical part of #5)
   (this is the classical definition of metadata, and it fits vocabularies,  
schemes and terminology repositories)
f) OTHER "information being derived from linguistic data or linguistically  
relevant datasets not directly containing linguistic data" (= #6+#4)

>> ISOcat: This is either metadata or language database (I am not sure of
>> the distinction), it is certainly not a lexicon
>
> since the entries are words it looked like a lexicon to me. I don't
> think it's a language database because it contains no languages. If
> others disagree on the lexicon category I would go for meta data (but I
> would answer on your category question "Could this resource be
> (directly) reduced to a list of terms? => Lexicon" with "yes" as well
> and therefore stay in the lexicon category)

Clearly metadata. Indeed, it is something like a lexicon, too, because it  
provides definitions for "words", but we use it in a very different way,  
and these aren't any words, but words that define the semantics of, say,  
annotations.

>> Rosetta Project: This is more a lexicon than a language database
>
> I would answer your "Could this resource be (directly) reduced to a list
> of terms? => Lexicon" clearly with no. I found on rosetta.org texts and
> language classification

That's a nice test, actually. Considering the Freebase data, it seems to  
be something like Glottolog, providing language identifiers and Document  
IDs (not strictly speaking bibliographical, though). I presume that the  
intended function is actually to provide ids for the Rosetta resources,  
so, with the slightly broadened definition of REFERENCE, it could be put  
in there.

>> SIMPLE, lemonWordnet, lemonWiktionary, lemonUby: These are lexica and
>> the URLs are clearly stated in the datahub source
>
> to me an RDF conversion of a lexicon is no lexicon anymore

Why's that ?

> I am still unsure with Multext-East, but Alpino and Semantic Quran are
> again only RDF versions, hence no corpora. The original RDF underlying
> data might have been corpora but an RDF version is no corpus (to me)

Well, using RDF to represent full corpora is recommendable under certain  
circumstances, e.g., if the annotations are too complex to be processed  
using more established technologies building on, say, tab-separated text,  
plain lists (e.g., Penn Treebank: Syntax) or XML (e.g., TIGER). However,  
encoding the full primary data in RDF causes a lot of overhead (we need to  
specify precedence explicitly, because this information is *lost* in the  
RDF data model), so in many situations, a hybrid solution with primary  
data represented in more conventional format (say, a text file), and  
annotations being represented in RDF may be the most efficient modeling.  
(For certain types of primary data (e.g., audio streams), encoding the  
primary data in RDF may not even be possible.)
In either case, the dataset still represents a corpus, if the RDF contains  
pointers to the primary data and it can be retrieved (even if not in RDF  
by itself).

>> WikiWord, WordNet 3.0, LODAC BDLS: These are also lexica, surely?
>
> WikiWord is only a tool to build a lexicon but no lexicon as such

But the WikiWord Thesaurus is: http://datahub.io/dataset/wikiword_thesaurus

We could create another category "TOOL" whose instances are *not* put in  
the diagram.

>> OLiA, lingvoj, LexInfo: These, much as ISOcat, provide categories for
>> linguistic annotation, these resource should all be in the same category
>
> unlike ISOcat these data sets contain no single word entry, again the
> user gets ontologies here and no lexicons (your first two questions for
> the lexicon category below have to be negated here)

I would classify all as METADATA.

>>     Keeping the
>>     LLOD cloud as a pure linguistic data cloud and providing a
>>     possibility to link to the LOD cloud (which already contains the
>>     ?unsure? data sets) could be a practical option here.

I would personally advise against purism at this point, as we're still in  
the early stages, but the discussion should be encouraged. Would any of  
the DBpedia people like to explain why they put it here in the first place?

Best,
Christian
-- 
Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931