[open-linguistics] How to represent LLOD diagram categories at datahub ?

Mon Nov 25 18:38:41 UTC 2013

Dear all,

I very much like John's question approach. A very linguistic methodology,  
actually. Some ideas to elaborate it below. A category is applicable if  
any of the questions applies.

a) LEXICON: "information about words and semantic concepts"
    Q: Is the resource organized around individual words and their meaning  
? Could I expect to find an entry (page) for a single lemma word (e.g.,  
"cat"?)
    Q: Does the resource provide identifiers for terms ? (Could this  
resource be reduced to a list of terms ?)

b) CORPUS: "information about texts and their linguistic organization"
    (footnote: "includes text fragments, e.g., to account for corpora  
composed out of scrambled, isolated sentences")
    Q: Does this resource consists primarily of sentences?
    Q: Does this resource include annotations directly applied on sentences?

c) LANGUAGE_DATABASE: "information about languages"
    Q: Does this resource describe inventories of linguistic features for  
entire languages ?
    Q: Does this resource describe languages (instead of words in that  
language)?

d) REFERENCE: "information about linguistic data sources" (bibliographical  
part of #5)
    Q: Does this resource provide information where to find linguistically  
relevant resources (e.g., text data) ?
    Q: Does this resource provide information how to identify  
linguistically relevant resources ?

e) METADATA: "information about the description of linguistic data"  
(non-bibliographical part of #5)
   (this is the classical definition of metadata, and it fits vocabularies,  
schemes and terminology repositories)
    Q: Does this resource define what labels to use to describe linguistic  
resources ?

f) OTHER "information being derived from linguistic data or linguistically  
relevant datasets not directly containing linguistic data" (= #6+#4)
    Q: Does none of the other categories apply despite a consensus to  
consider this a relevant resource ? (Can the latter be confirmed by asking  
the people on the mailing list ?)

I still do think that the categories do not need to be mutually exclusive!  
We only need to provide a preference ranking for the visualization, say:

LEXICON > CORPUS > LANGUAGE_DATABASE > REFERENCE > METADATA > OTHER

(or any other ranking)

We still need to define what "linguistically relevant" means. A first,  
broad proposal:

"A resource is linguistically relevant if either
(a) it represents results or basis of research in linguistics,
(b) it can be used to develop tools and resources that facilitate  
linguistic analyses of natural language (NLP), or
(c) it is used to provide information about any of the resources in (a) or  
(b) or necessary pre-requisites to create them"

(a) applies to LEXICON, CORPUS, LANGUAGE_DATABASE, maybe METADATA and OTHER
(b) applies to LEXICON, CORPUS, METADATA, maybe REFERENCE and OTHER
(c) applies to REFERENCE, METADATA, LANGUAGE_DATABASE, maybe OTHER

Best,
Christian

On Mon, 25 Nov 2013 19:14:51 +0100, Christian Chiarcos  
<christian.chiarcos at web.de> wrote:

> Dear Bettina, dear John,
>
> first of all, thanks to Bettina for her review of the data sets. I  
> probably missed a part of the conversation, but please find my two cents  
> below:
>
>>>     LLOD data set categories:
>>>     1) Data representing words = Lexicon
>>>     2) Data representing texts = Corpus
>>>     3) Data representing information about languages = Language  
>>> Database
>>>     4) Data derived derived from but conserving original data =  
>>> Ontology
>>>
>>> Â How is this an "ontology"? I would call it something like
>>> "statistical data"
>>
> ...
>>
>>>     5) Data representing further knowledge about the data (such as
>>>     bibliographical references) = Meta Data
>>>     6) maybe: category for further relevant data sets not directly
>>>     concerned with linguistic data
>
> On #1: Given the fact that we now have FrameNet (through uby), and that  
> some of our lexicons do not really deal with words, but more with  
> concepts, we should probably extend the definition to information beyond  
> words:
>
> #1 "information about words and semantic concepts"
>
> On #2 vs. #4: I feel that an LLOD definition of "corpus" should allow  
> for a standoff solution, i.e., a corpus where the annotations are  
> provided, but the source data isn't (this is to be expected as the  
> ultimate goal of a developmental trajectory that we see manifested in  
> generalized formats such as GrAF). Actually, NIF explicitly allows us to  
> do so. But with only annotations being in RDF and the original text in,  
> say, HTML or TEI, possibly with some RDFa attributes, the linked open  
> data component does not really *represent* the text, but only the  
> linguistic structure of a text. The should remain valid corpora, but  
> according to your definition, they would become #4, because they may  
> converve parts of the original data (e.g., lemma annotation). So, I  
> would suggest to rephrase #2 as
>
> #2 "information about texts and their linguistic organization"
>
> On #3 vs. #4: A similar problem persists with respect to ASJP: This  
> would be derived information, hence #4? But what does derivation  
> actually mean here?
>
> On #4 vs. #6: Honestly, I'm not too happy about #4. Neither the name nor  
> the definition. Maybe rename the category to "other" for the time being  
> ? Dissatisfying, of course, but possibly the most achievable compromise.  
> The above-mentioned problems would fade away, and we would avoid the  
> slightly dissatisfying definition of the "non-linguistic" category #6.
>
> On #3 vs. #5: If we single out information about language as #3 (which I  
> would support, if we have enough datasets), then we should also  
> distinguish purely bibliographical information from linguistically  
> relevant information. These are included, even if not concerned with  
> linguistic literature, because they point to possible data sources for  
> linguistic information, but they also represent an old, rich and growing  
> group of LOD data sets which are clearly distinct from everything else  
> we're dealing with.
>
> As a minor remark on glottolog, I would clearly see it more on the side  
> of #5 rather than #3, but eventually, this is up to the data provider to  
> be decided (I would suggest, at least). So, any ideas on this, Sebastian  
> ?
>
> To sum up (with new numbers, to avoid confusion):
>
> a) LEXICON: "information about words and semantic concepts" (=#1)
> b) CORPUS: "information about texts and their linguistic organization"  
> (=#2)
>     (footnote: "includes text fragments, e.g., to account for corpora  
> composed out of scrambled, isolated sentences")
> c) LANGUAGE_DATABASE: "information about languages" (= #3)
> d) REFERENCE: "information about linguistic data sources" (mostly  
> bibliography, but we can include the LREMap and tool repositories, here,  
> too) (bibliographical part of #5)
> e) METADATA: "information about the description of linguistic data"  
> (non-bibliographical part of #5)
>    (this is the classical definition of metadata, and it fits  
> vocabularies, schemes and terminology repositories)
> f) OTHER "information being derived from linguistic data or  
> linguistically relevant datasets not directly containing linguistic  
> data" (= #6+#4)
>
>>> ISOcat: This is either metadata or language database (I am not sure of
>>> the distinction), it is certainly not a lexicon
>>
>> since the entries are words it looked like a lexicon to me. I don't
>> think it's a language database because it contains no languages. If
>> others disagree on the lexicon category I would go for meta data (but I
>> would answer on your category question "Could this resource be
>> (directly) reduced to a list of terms? => Lexicon" with "yes" as well
>> and therefore stay in the lexicon category)
>
> Clearly metadata. Indeed, it is something like a lexicon, too, because  
> it provides definitions for "words", but we use it in a very different  
> way, and these aren't any words, but words that define the semantics of,  
> say, annotations.
>
>>> Rosetta Project: This is more a lexicon than a language database
>>
>> I would answer your "Could this resource be (directly) reduced to a list
>> of terms? => Lexicon" clearly with no. I found on rosetta.org texts and
>> language classification
>
> That's a nice test, actually. Considering the Freebase data, it seems to  
> be something like Glottolog, providing language identifiers and Document  
> IDs (not strictly speaking bibliographical, though). I presume that the  
> intended function is actually to provide ids for the Rosetta resources,  
> so, with the slightly broadened definition of REFERENCE, it could be put  
> in there.
>
>>> SIMPLE, lemonWordnet, lemonWiktionary, lemonUby: These are lexica and
>>> the URLs are clearly stated in the datahub source
>>
>> to me an RDF conversion of a lexicon is no lexicon anymore
>
> Why's that ?
>
>> I am still unsure with Multext-East, but Alpino and Semantic Quran are
>> again only RDF versions, hence no corpora. The original RDF underlying
>> data might have been corpora but an RDF version is no corpus (to me)
>
> Well, using RDF to represent full corpora is recommendable under certain  
> circumstances, e.g., if the annotations are too complex to be processed  
> using more established technologies building on, say, tab-separated  
> text, plain lists (e.g., Penn Treebank: Syntax) or XML (e.g., TIGER).  
> However, encoding the full primary data in RDF causes a lot of overhead  
> (we need to specify precedence explicitly, because this information is  
> *lost* in the RDF data model), so in many situations, a hybrid solution  
> with primary data represented in more conventional format (say, a text  
> file), and annotations being represented in RDF may be the most  
> efficient modeling. (For certain types of primary data (e.g., audio  
> streams), encoding the primary data in RDF may not even be possible.)
> In either case, the dataset still represents a corpus, if the RDF  
> contains pointers to the primary data and it can be retrieved (even if  
> not in RDF by itself).
>
>>> WikiWord, WordNet 3.0, LODAC BDLS: These are also lexica, surely?
>>
>> WikiWord is only a tool to build a lexicon but no lexicon as such
>
> But the WikiWord Thesaurus is:  
> http://datahub.io/dataset/wikiword_thesaurus
>
> We could create another category "TOOL" whose instances are *not* put in  
> the diagram.
>
>>> OLiA, lingvoj, LexInfo: These, much as ISOcat, provide categories for
>>> linguistic annotation, these resource should all be in the same  
>>> category
>>
>> unlike ISOcat these data sets contain no single word entry, again the
>> user gets ontologies here and no lexicons (your first two questions for
>> the lexicon category below have to be negated here)
>
> I would classify all as METADATA.
>
>>>     Keeping the
>>>     LLOD cloud as a pure linguistic data cloud and providing a
>>>     possibility to link to the LOD cloud (which already contains the
>>>     ?unsure? data sets) could be a practical option here.
>
> I would personally advise against purism at this point, as we're still  
> in the early stages, but the discussion should be encouraged. Would any  
> of the DBpedia people like to explain why they put it here in the first  
> place?
>
> Best,
> Christian

-- 
Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931