[open-linguistics] How to represent LLOD diagram categories at datahub ?
Christian Chiarcos
christian.chiarcos at web.de
Mon Nov 25 18:14:51 UTC 2013
Dear Bettina, dear John,
first of all, thanks to Bettina for her review of the data sets. I
probably missed a part of the conversation, but please find my two cents
below:
>> LLOD data set categories:
>> 1) Data representing words = Lexicon
>> 2) Data representing texts = Corpus
>> 3) Data representing information about languages = Language Database
>> 4) Data derived derived from but conserving original data = Ontology
>>
>> Â How is this an "ontology"? I would call it something like
>> "statistical data"
>
...
>
>> 5) Data representing further knowledge about the data (such as
>> bibliographical references) = Meta Data
>> 6) maybe: category for further relevant data sets not directly
>> concerned with linguistic data
On #1: Given the fact that we now have FrameNet (through uby), and that
some of our lexicons do not really deal with words, but more with
concepts, we should probably extend the definition to information beyond
words:
#1 "information about words and semantic concepts"
On #2 vs. #4: I feel that an LLOD definition of "corpus" should allow for
a standoff solution, i.e., a corpus where the annotations are provided,
but the source data isn't (this is to be expected as the ultimate goal of
a developmental trajectory that we see manifested in generalized formats
such as GrAF). Actually, NIF explicitly allows us to do so. But with only
annotations being in RDF and the original text in, say, HTML or TEI,
possibly with some RDFa attributes, the linked open data component does
not really *represent* the text, but only the linguistic structure of a
text. The should remain valid corpora, but according to your definition,
they would become #4, because they may converve parts of the original data
(e.g., lemma annotation). So, I would suggest to rephrase #2 as
#2 "information about texts and their linguistic organization"
On #3 vs. #4: A similar problem persists with respect to ASJP: This would
be derived information, hence #4? But what does derivation actually mean
here?
On #4 vs. #6: Honestly, I'm not too happy about #4. Neither the name nor
the definition. Maybe rename the category to "other" for the time being ?
Dissatisfying, of course, but possibly the most achievable compromise. The
above-mentioned problems would fade away, and we would avoid the slightly
dissatisfying definition of the "non-linguistic" category #6.
On #3 vs. #5: If we single out information about language as #3 (which I
would support, if we have enough datasets), then we should also
distinguish purely bibliographical information from linguistically
relevant information. These are included, even if not concerned with
linguistic literature, because they point to possible data sources for
linguistic information, but they also represent an old, rich and growing
group of LOD data sets which are clearly distinct from everything else
we're dealing with.
As a minor remark on glottolog, I would clearly see it more on the side of
#5 rather than #3, but eventually, this is up to the data provider to be
decided (I would suggest, at least). So, any ideas on this, Sebastian ?
To sum up (with new numbers, to avoid confusion):
a) LEXICON: "information about words and semantic concepts" (=#1)
b) CORPUS: "information about texts and their linguistic organization"
(=#2)
(footnote: "includes text fragments, e.g., to account for corpora
composed out of scrambled, isolated sentences")
c) LANGUAGE_DATABASE: "information about languages" (= #3)
d) REFERENCE: "information about linguistic data sources" (mostly
bibliography, but we can include the LREMap and tool repositories, here,
too) (bibliographical part of #5)
e) METADATA: "information about the description of linguistic data"
(non-bibliographical part of #5)
(this is the classical definition of metadata, and it fits vocabularies,
schemes and terminology repositories)
f) OTHER "information being derived from linguistic data or linguistically
relevant datasets not directly containing linguistic data" (= #6+#4)
>> ISOcat: This is either metadata or language database (I am not sure of
>> the distinction), it is certainly not a lexicon
>
> since the entries are words it looked like a lexicon to me. I don't
> think it's a language database because it contains no languages. If
> others disagree on the lexicon category I would go for meta data (but I
> would answer on your category question "Could this resource be
> (directly) reduced to a list of terms? => Lexicon" with "yes" as well
> and therefore stay in the lexicon category)
Clearly metadata. Indeed, it is something like a lexicon, too, because it
provides definitions for "words", but we use it in a very different way,
and these aren't any words, but words that define the semantics of, say,
annotations.
>> Rosetta Project: This is more a lexicon than a language database
>
> I would answer your "Could this resource be (directly) reduced to a list
> of terms? => Lexicon" clearly with no. I found on rosetta.org texts and
> language classification
That's a nice test, actually. Considering the Freebase data, it seems to
be something like Glottolog, providing language identifiers and Document
IDs (not strictly speaking bibliographical, though). I presume that the
intended function is actually to provide ids for the Rosetta resources,
so, with the slightly broadened definition of REFERENCE, it could be put
in there.
>> SIMPLE, lemonWordnet, lemonWiktionary, lemonUby: These are lexica and
>> the URLs are clearly stated in the datahub source
>
> to me an RDF conversion of a lexicon is no lexicon anymore
Why's that ?
> I am still unsure with Multext-East, but Alpino and Semantic Quran are
> again only RDF versions, hence no corpora. The original RDF underlying
> data might have been corpora but an RDF version is no corpus (to me)
Well, using RDF to represent full corpora is recommendable under certain
circumstances, e.g., if the annotations are too complex to be processed
using more established technologies building on, say, tab-separated text,
plain lists (e.g., Penn Treebank: Syntax) or XML (e.g., TIGER). However,
encoding the full primary data in RDF causes a lot of overhead (we need to
specify precedence explicitly, because this information is *lost* in the
RDF data model), so in many situations, a hybrid solution with primary
data represented in more conventional format (say, a text file), and
annotations being represented in RDF may be the most efficient modeling.
(For certain types of primary data (e.g., audio streams), encoding the
primary data in RDF may not even be possible.)
In either case, the dataset still represents a corpus, if the RDF contains
pointers to the primary data and it can be retrieved (even if not in RDF
by itself).
>> WikiWord, WordNet 3.0, LODAC BDLS: These are also lexica, surely?
>
> WikiWord is only a tool to build a lexicon but no lexicon as such
But the WikiWord Thesaurus is: http://datahub.io/dataset/wikiword_thesaurus
We could create another category "TOOL" whose instances are *not* put in
the diagram.
>> OLiA, lingvoj, LexInfo: These, much as ISOcat, provide categories for
>> linguistic annotation, these resource should all be in the same category
>
> unlike ISOcat these data sets contain no single word entry, again the
> user gets ontologies here and no lexicons (your first two questions for
> the lexicon category below have to be negated here)
I would classify all as METADATA.
>> Keeping the
>> LLOD cloud as a pure linguistic data cloud and providing a
>> possibility to link to the LOD cloud (which already contains the
>> ?unsure? data sets) could be a practical option here.
I would personally advise against purism at this point, as we're still in
the early stages, but the discussion should be encouraged. Would any of
the DBpedia people like to explain why they put it here in the first place?
Best,
Christian
--
Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931
More information about the open-linguistics
mailing list