[open-linguistics] How to represent LLOD diagram categories at datahub ?

Bettina Klimek klimek at informatik.uni-leipzig.de
Mon Nov 25 10:38:25 UTC 2013

On 11/22/2013 04:56 PM, John P. McCrae wrote:
> Hi Bettina
> On Fri, Nov 22, 2013 at 5:54 AM, Bettina Klimek 
> <klimek at informatik.uni-leipzig.de 
> <mailto:klimek at informatik.uni-leipzig.de>> wrote:
>     Dear all,
>     at the last telco I agreed to categorize all the data sets in the
>     datahub. In the telco we set up the following three categories:
>     Lexicon ? based on data containing words
>     Corpus ? based on data containing text
>     Linguistic Database ? based on data containing languages
>     After I looked up every single data set in the datahub several
>     problems occurred:
>     1) some data sets seem to be wrong/spam
> Yep, that is the purpose of the "upgrade" of DataHub to remove some spam

It seems like someone already deleted the spam data sets :)

>     2) the source URLs in some data sets do not work or are missing
>     3) there seem to be less data sets at datahub than there are
>     bubbles in the cloud diagram
> The cloud diagram is generated from datahub.io... it is not possible 
> that there are more bubbles than are present in the system

then there must be all the other data sets somewhere else on datahub, I 
only looked at the ones in the linguistics group 
(_http://www.datahub.io/organization/linguistics_) because I assumed to 
find all data sets there

>     4) some data sets are not linked yet
>     Concerning these issues I would like to know if there is someone
>     managing and maintaining the correctness and completeness of the
>     data set entries at datahub? I think this is a matter of quality
>     that should not be ignored.
> Well, we discussed this at the last telco... the conclusion is that we 
> should look into the development of a new repository, which does 
> things like check that URLs are valid and automatically counts things 
> like the triple count and link count.
> Also as has been noted DataHub has been recently upgraded to 
> "unusable", so a better service would be highly useful.
> Â
>     5) a categorization in terms of the three categories defined above
>     does not work for all the data sets
>     While going through all the data sets manifold questions
>     concerning the LLOD cloud as such occurred to me. The most
>     important one is: what is linguistic data? It is not defined
>     anywhere on the website and the data sets reveal all kinds of data
>     which differ in their information provided, purpose of
>     representation, data types they supply to the user as well as in
>     the linguistic content. To me it seemed as if the categorization
>     of the data sets is taking the second step without the first. The
>     first step should be to  set up a definition of what linguistic
>     data is and as a result which kind of data sets are to be expected
>     as belonging into the cloud.
>     When thinking of the category labels I tried to find some that
>     would be broadly accepted by the users. However there seems to be
>     no ideal user. A user could be a linguist interested in human
>     languages, a computer linguist, a computer scientist or someone
>     interested in NLP (even more different users are thinkable). The
>     data sets mirror the diversity of users within the data they
>     provide. I think it is impossible to find category labels which
>     can cover all the special research areas linguistic data could be
>     used for. That is why I tried to establish categories everybody
>     understands intuitively. Thus I have assigned the data sets to the
>     following 5-6 categories, which are defined as follows:
>     LLOD data set categories:
>     1) Data representing words = Lexicon
>     2) Data representing texts = Corpus
>     3) Data representing information about languages = Language Database
>     4) Data derived derived from but conserving original data = Ontology
> Â How is this an "ontology"? I would call it something like 
> "statistical data"

I used "ontology" here because this is what the data sets in this 
catagory contain. I couldn't come up with a better category label for 
all the data sets that are mainly interested in providing rdf or owl 
versions of other data and I thought with the label "ontology" the user 
group looking only for such data could identify it easier that way. 
Maybe the definition should be changed then into "data representing 
other data as semantic web technology" or simply name it "derived data"? 
But in principle to me it is just data which is derived from other data 
(which in most cases isn't even provided).

>     5) Data representing further knowledge about the data (such as
>     bibliographical references) = Meta Data
>     6) maybe: category for further relevant data sets not directly
>     concerned with linguistic data
>     As you can see I kept the first two categories because they are
>     straightforward and users intuitively know what a lexicon and a
>     corpus are. I established category #3 for those data sets dealing
>     with certain information on a large number of languages which
>     cannot easily be understood as lexicon or corpus. This holds for
>     PHOIBLE and ASJP. The former lists phonological segments that are
>     below the word and text level and which are provided for many
>     languages. The latter is a language classification based on
>     lexical similarities (which are not given though). Another case is
>     glottolog being a language classification as well but additionally
>     providing relevant bibliographical sources. Since these references
>     are set up in a distinct application (Langdoc) glottolog is one
>     special case covering two categories: Language Database and Meta
>     Data. Another case, even more diverse, is the Multext-East data
>     set which seems to cover the categories Lexicon, Corpus as well as
>     Ontology. We can debate about the assignment of a data set to two
>     categories. Maybe it is better to force everybody uploading data
>     sets to select for only one default category.
>     Furthermore the assignment of the data sets to the categories is
>     based on what is to be provided primarily! The ASIt for instance
>     evokes the impression that some kind of Italian corpora are the
>     content of the data set. However it is rather interested in
>     presenting the analysis of various corpora (which are not provided
>     as such!) in form of rdf-schema. Data sets like these are the
>     reason why I set up category #4. These are problematic because
>     they are based on human language data but converted into a machine
>     readable language and by doing so totally changing the user
>     interested in the data. The linguist would be interested in a user
>     friendly searchable language corpus and the computer
>     linguist/scientist is only concerned with the rdf data for
>     instance. To solve this problem I assigned data sets primarily
>     providing human language data (even if there is also an rdf
>     verision) to the categories #1 to #3 and data sets being concerned
>     with machine language in any way (only using human language
>     material in order to do so) to category #4.
>     I think it is essential to put these definitions into the cloud
>     (maybe as tooltips), because the Meta Data category for instance
>     will be associated with different information depending on whether
>     a user is concerned with human language or with computer languages.
>     My classification of the data sets can be seen at:
>     https://docs.google.com/spreadsheet/ccc?key=0AkVaxylrRsewdGwtckloTlYyZ25iY2duTkQwdlZBaHc&usp=sharing
> There seem to be a lot of odd classifications here:
> Leipzig Corpus Collection: Corpus not lexicon

thank you, corrected it

> ISOcat: This is either metadata or language database (I am not sure of 
> the distinction), it is certainly not a lexicon

since the entries are words it looked like a lexicon to me. I don't 
think it's a language database because it contains no languages. If 
others disagree on the lexicon category I would go for meta data (but I 
would answer on your category question "Could this resource be 
(directly) reduced to a list of terms? => Lexicon" with "yes" as well 
and therefore stay in the lexicon category)

> Rosetta Project: This is more a lexicon than a language database

I would answer your "Could this resource be (directly) reduced to a list 
of terms? => Lexicon" clearly with no. I found on rosetta.org texts and 
language classification

> SIMPLE, lemonWordnet, lemonWiktionary, lemonUby: These are lexica and 
> the URLs are clearly stated in the datahub source

to me an RDF conversion of a lexicon is no lexicon anymore

SIMPLE: no URL here _http://www.datahub.io/dataset/simple_

lemonWordnet: no URL here _http://www.datahub.io/dataset/lemonwordnet_

LemonWiktionary: no URL here _http://www.datahub.io/dataset/lemonwiktionary_

lemonUby: no URL here _http://www.datahub.io/dataset/lemonuby_

- this seems to be my fault as I assumed the source URL would be in the 
"additional info"-table in the "source"-field at the bottom on the 
"main"-datahub entry page, just as they are for most of the other data 
sets (I thought the source URL could be found consistently at the same 
place in that table..)

> Lemon: This is a schema, not data (hence why it isn't in Datahub)

I deleted it out of the google spreadsheet

> WikiWord, WordNet 3.0, LODAC BDLS: These are also lexica, surely?

WikiWord is only a tool to build a lexicon but no lexicon as such

Wordnet3.0 is again only an RDF conversion of a lexicon (the princeton 
wordnet) and therefore no lexicon to me

LODAC BDLS is a "linked data version" of the BDLS, so no lexicon as such 

> OLiA, lingvoj, LexInfo: These, much as ISOcat, provide categories for 
> linguistic annotation, these resource should all be in the same category

unlike ISOcat these data sets contain no single word entry, again the 
user gets ontologies here and no lexicons (your first two questions for 
the lexicon category below have to be negated here)

> Alpino, Semantic Quran, Multext-East: These are corpora, surely?
> Â

I am still unsure with Multext-East, but Alpino and Semantic Quran are 
again only RDF versions, hence no corpora. The original RDF underlying 
data might have been corpora but an RDF version is no corpus (to me)

>     There you can also see some ?unsure data sets? such as DBpedia. I
>     think these are corresponding to the ?resources used to assist and
>     augment language processing applications, even if the nature of
>     the resource is not deeply entrenched in Linguistics, but only as
>     long as the usefulness is well motivated? as explained on:
>     http://www.semantic-web-journal.net/blog/call-multilingual-linked-open-data-mlod-2012-data-post-proceedings.
>     I couldn?t come up with an appropriate category label for them yet
>     (maybe ?indirectly relevant resources??!). And I am not sure if
>     they should be in the cloud at all because the data sets are
>     already so heterogeneous that adding not primarily linguistic data
>     sets might cause even more confusion to the user and maybe also
>     frustration when he won?t find only linguistic data. Keeping the
>     LLOD cloud as a pure linguistic data cloud and providing a
>     possibility to link to the LOD cloud (which already contains the
>     ?unsure? data sets) could be a practical option here.
>     This is just another proposal for a possible classification. I
>     think it is efficient enough to also cover data sets which are to
>     come in the future. I know it is very broad and not very specific,
>     but since I am of the opinion that further sub-classifications
>     need to be developed with regard to the different linguistic user
>     types, we can work on that more carefully later. As a first
>     impression of the cloud the 5-6 categories I proposed here should
>     be sufficient to get an overview of the main cloud content.
> I think it would be best to prefer very broad categories, things can 
> get fuzzy quite quickly, (for example WordNet contains examples and 
> definition sentences is it therefore a corpus?)
> Â

I think wordnet is not fuzzy, if the main aim is centred around words it 
can't be a corpus. Giving some example sentences only serves to add to 
the meaning of the words but is not intended to give a broad text 
resource (as a corpus would do).

> I think that the perhaps it would be good to find some simple criteria 
> to classify models that would clarify the kind of data in certain 
> circumstances, such as:
> Could I expect to find an entry (page) for a single lemma word (e.g., 
> "cat"?) => Lexicon
> Could this resource be (directly) reduced to a list of terms? => Lexicon
> Does this resource consists primarily of sentences? => Corpus
> Is there annotation directly on sentences? => Corpus
> Does this resource define the set of part of speech as part of its 
> data (not schema)? => Language Database
> Does this resource describe languages (instead of words in that 
> language)? => Language Database
> etc.

I agree with these test questions for the three categories lexicon, 
corpus and language database. Infact the answers for some of the 
datasets is "no" for all three categories. These concern mostly the data 
sets I put into my ontology category, because I think they differ 
fundamentally from the others.

> Regards,
> John
> Â
>     Looking forward to your comments,
>     Bettina
>     ----------------------------------------------------------------
>     This message was sent using IMP, the Internet Messaging Program.
>     _______________________________________________
>     open-linguistics mailing list
>     open-linguistics at lists.okfn.org
>     <mailto:open-linguistics at lists.okfn.org>
>     http://lists.okfn.org/mailman/listinfo/open-linguistics
>     Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics

Bettina Klimek

Universität Leipzig
Institut für Informatik
Augustusplatz 10, 04109 Leipzig

e-mail: klimek at informatik.uni-leipzig.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20131125/0f6dd33b/attachment-0003.html>

More information about the open-linguistics mailing list