[open-linguistics] How to represent LLOD diagram categories at datahub ?

John P. McCrae jmccrae at cit-ec.uni-bielefeld.de
Mon Nov 25 15:02:03 UTC 2013


Hi Bettina,

Thanks for your reply.

I find it odd to separate conversions to RDF from other data of the same
category of data. For me an RDF lexicon is an RDF lexicon, whether or not
it was originally published in RDF or in XML and then converted to RDF.
Also, as RDF is the only data we currently include in the cloud, nearly all
resources have been converted into RDF.


ISOcat is really not a lexicon, it is a controlled vocabulary, perhaps not
as axiomatic as an ontology but the purpose is the same. An ontology with
labels can be easily converted to a lexicon, the difference between
lingvoj, etc. and WordNet is in the purpose of the use of the resource,
that is they are for annotation (as is ISOcat). Perhaps the first question
needs to be revised somehow

Regards,
John


On Mon, Nov 25, 2013 at 5:38 AM, Bettina Klimek <
klimek at informatik.uni-leipzig.de> wrote:

>  On 11/22/2013 04:56 PM, John P. McCrae wrote:
>
> Hi Bettina
>
>
> On Fri, Nov 22, 2013 at 5:54 AM, Bettina Klimek <
> klimek at informatik.uni-leipzig.de> wrote:
>
>> Dear all,
>>
>> at the last telco I agreed to categorize all the data sets in the
>> datahub. In the telco we set up the following three categories:
>>
>> Lexicon ? based on data containing words
>> Corpus ? based on data containing text
>> Linguistic Database ? based on data containing languages
>>
>> After I looked up every single data set in the datahub several problems
>> occurred:
>>
>> 1) some data sets seem to be wrong/spam
>>
> Yep, that is the purpose of the "upgrade" of DataHub to remove some spam
>
>
> It seems like someone already deleted the spam data sets :)
>
>
>     2) the source URLs in some data sets do not work or are missing
>> 3) there seem to be less data sets at datahub than there are bubbles in
>> the cloud diagram
>>
> The cloud diagram is generated from datahub.io... it is not possible that
> there are more bubbles than are present in the system
>
>
> then there must be all the other data sets somewhere else on datahub, I
> only looked at the ones in the linguistics group (*http://www.datahub.io/organization/linguistics
> <http://www.datahub.io/organization/linguistics>*) because I assumed to
> find all data sets there
>
>      4) some data sets are not linked yet
>>
>> Concerning these issues I would like to know if there is someone managing
>> and maintaining the correctness and completeness of the data set entries at
>> datahub? I think this is a matter of quality that should not be ignored.
>>
> Well, we discussed this at the last telco... the conclusion is that we
> should look into the development of a new repository, which does things
> like check that URLs are valid and automatically counts things like the
> triple count and link count.
>
>  Also as has been noted DataHub has been recently upgraded to "unusable",
> so a better service would be highly useful.
>  Â
>
>>
>> 5) a categorization in terms of the three categories defined above does
>> not work for all the data sets
>>
>> While going through all the data sets manifold questions concerning the
>> LLOD cloud as such occurred to me. The most important one is: what is
>> linguistic data? It is not defined anywhere on the website and the data
>> sets reveal all kinds of data which differ in their information provided,
>> purpose of representation, data types they supply to the user as well as in
>> the linguistic content. To me it seemed as if the categorization of the
>> data sets is taking the second step without the first. The first step
>> should be to  set up a definition of what linguistic data is and as a
>> result which kind of data sets are to be expected as belonging into the
>> cloud.
>>
>>
>> When thinking of the category labels I tried to find some that would be
>> broadly accepted by the users. However there seems to be no ideal user. A
>> user could be a linguist interested in human languages, a computer
>> linguist, a computer scientist or someone interested in NLP (even more
>> different users are thinkable). The data sets mirror the diversity of users
>> within the data they provide. I think it is impossible to find category
>> labels which can cover all the special research areas linguistic data could
>> be used for. That is why I tried to establish categories everybody
>> understands intuitively. Thus I have assigned the data sets to the
>> following 5-6 categories, which are defined as follows:
>>
>> LLOD data set categories:
>> 1) Data representing words = Lexicon
>> 2) Data representing texts = Corpus
>> 3) Data representing information about languages = Language Database
>> 4) Data derived derived from but conserving original data = Ontology
>>
> Â How is this an "ontology"? I would call it something like "statistical
> data"
>
>
> I used “ontology“ here because this is what the data sets in this catagory
> contain. I couldn’t come up with a better category label for all the data
> sets that are mainly interested in providing rdf or owl versions of other
> data and I thought with the label “ontology“ the user group looking only
> for such data could identify it easier that way. Maybe the definition
> should be changed then into “data representing other data as semantic web
> technology“ or simply name it "derived data"? But in principle to me it is
> just data which is derived from other data (which in most cases isn’t even
> provided).
>
>
>     5) Data representing further knowledge about the data (such as
>> bibliographical references) = Meta Data
>> 6) maybe: category for further relevant data sets not directly concerned
>> with linguistic data
>>
>> As you can see I kept the first two categories because they are
>> straightforward and users intuitively know what a lexicon and a corpus are.
>> I established category #3 for those data sets dealing with certain
>> information on a large number of languages which cannot easily be
>> understood as lexicon or corpus. This holds for PHOIBLE and ASJP. The
>> former lists phonological segments that are below the word and text level
>> and which are provided for many languages. The latter is a language
>> classification based on lexical similarities (which are not given though).
>> Another case is glottolog being a language classification as well but
>> additionally providing relevant bibliographical sources. Since these
>> references are set up in a distinct application (Langdoc) glottolog is one
>> special case covering two categories: Language Database and Meta Data.
>> Another case, even more diverse, is the Multext-East data set which seems
>> to cover the categories Lexicon, Corpus as well as Ontology. We can debate
>> about the assignment of a data set to two categories. Maybe it is better to
>> force everybody uploading data sets to select for only one default category.
>>
>> Furthermore the assignment of the data sets to the categories is based on
>> what is to be provided primarily! The ASIt for instance evokes the
>> impression that some kind of Italian corpora are the content of the data
>> set. However it is rather interested in presenting the analysis of various
>> corpora (which are not provided as such!) in form of rdf-schema. Data sets
>> like these are the reason why I set up category #4. These are problematic
>> because they are based on human language data but converted into a machine
>> readable language and by doing so totally changing the user interested in
>> the data. The linguist would be interested in a user friendly searchable
>> language corpus and the computer linguist/scientist is only concerned with
>> the rdf data for instance. To solve this problem I assigned data sets
>> primarily providing human language data (even if there is also an rdf
>> verision) to the categories #1 to #3 and data sets being concerned with
>> machine language in any way (only using human language material in order to
>> do so) to category #4.
>>
>> I think it is essential to put these definitions into the cloud (maybe as
>> tooltips), because the Meta Data category for instance will be associated
>> with different information depending on whether a user is concerned with
>> human language or with computer languages.
>>
>> My classification of the data sets can be seen at:
>>
>> https://docs.google.com/spreadsheet/ccc?key=0AkVaxylrRsewdGwtckloTlYyZ25iY2duTkQwdlZBaHc&usp=sharing
>>
> There seem to be a lot of odd classifications here:
>
>  Leipzig Corpus Collection: Corpus not lexicon
>
>
> thank you, corrected it
>
>    ISOcat: This is either metadata or language database (I am not sure of
> the distinction), it is certainly not a lexicon
>
>
> since the entries are words it looked like a lexicon to me. I don't think
> it's a language database because it contains no languages. If others
> disagree on the lexicon category I would go for meta data (but I would
> answer on your category question “Could this resource be (directly) reduced
> to a list of terms? => Lexicon“ with "yes" as well and therefore stay in
> the lexicon category)
>
>
>     Rosetta Project: This is more a lexicon than a language database
>
>
> I would answer your “Could this resource be (directly) reduced to a list
> of terms? => Lexicon“ clearly with no. I found on rosetta.org texts and
> language classification
>
>
>    SIMPLE, lemonWordnet, lemonWiktionary, lemonUby: These are lexica and
> the URLs are clearly stated in the datahub source
>
>
> to me an RDF conversion of a lexicon is no lexicon anymore
>
> SIMPLE: no URL here *http://www.datahub.io/dataset/simple
> <http://www.datahub.io/dataset/simple>*
>
> lemonWordnet: no URL here *http://www.datahub.io/dataset/lemonwordnet
> <http://www.datahub.io/dataset/lemonwordnet>*
>
> LemonWiktionary: no URL here *http://www.datahub.io/dataset/lemonwiktionary
> <http://www.datahub.io/dataset/lemonwiktionary>*
>
> lemonUby: no URL here *http://www.datahub.io/dataset/lemonuby
> <http://www.datahub.io/dataset/lemonuby>*
>
> - this seems to be my fault as I assumed the source URL would be in the
> “additional info”-table in the “source”-field at the bottom on the
> “main”-datahub entry page, just as they are for most of the other data sets
> (I thought the source URL could be found consistently at the same place in
> that table..)
>
>
>     Lemon: This is a schema, not data (hence why it isn't in Datahub)
>
>
> I deleted it out of the google spreadsheet
>
>    WikiWord, WordNet 3.0, LODAC BDLS: These are also lexica, surely?
>
>
> WikiWord is only a tool to build a lexicon but no lexicon as such
>
> Wordnet3.0 is again only an RDF conversion of a lexicon (the princeton
> wordnet) and therefore no lexicon to me
>
> LODAC BDLS is a “linked data version“ of the BDLS, so no lexicon as such
> anymore
>
>
>    OLiA, lingvoj, LexInfo: These, much as ISOcat, provide categories for
> linguistic annotation, these resource should all be in the same category
>
>
> unlike ISOcat these data sets contain no single word entry, again the user
> gets ontologies here and no lexicons (your first two questions for the
> lexicon category below have to be negated here)
>
>     Alpino, Semantic Quran, Multext-East: These are corpora, surely?
> Â
>
>
> I am still unsure with Multext-East, but Alpino and Semantic Quran are
> again only RDF versions, hence no corpora. The original RDF underlying data
> might have been corpora but an RDF version is no corpus (to me)
>
>
>
>> There you can also see some ?unsure data sets? such as DBpedia. I think
>> these are corresponding to the ?resources used to assist and augment
>> language processing applications, even if the nature of the resource is not
>> deeply entrenched in Linguistics, but only as long as the usefulness is
>> well motivated? as explained on:
>> http://www.semantic-web-journal.net/blog/call-multilingual-linked-open-data-mlod-2012-data-post-proceedings.
>> I couldn?t come up with an appropriate category label for them yet (maybe
>> ?indirectly relevant resources??!). And I am not sure if they should be in
>> the cloud at all because the data sets are already so heterogeneous that
>> adding not primarily linguistic data sets might cause even more confusion
>> to the user and maybe also frustration when he won?t find only linguistic
>> data. Keeping the LLOD cloud as a pure linguistic data cloud and providing
>> a possibility to link to the LOD cloud (which already contains the ?unsure?
>> data sets) could be a practical option here.
>>
>> This is just another proposal for a possible classification. I think it
>> is efficient enough to also cover data sets which are to come in the
>> future. I know it is very broad and not very specific, but since I am of
>> the opinion that further sub-classifications need to be developed with
>> regard to the different linguistic user types, we can work on that more
>> carefully later. As a first impression of the cloud the 5-6 categories I
>> proposed here should be sufficient to get an overview of the main cloud
>> content.
>>
>>  I think it would be best to prefer very broad categories, things can
> get fuzzy quite quickly, (for example WordNet contains examples and
> definition sentences is it therefore a corpus?)
>  Â
>
>
> I think wordnet is not fuzzy, if the main aim is centred around words it
> can't be a corpus. Giving some example sentences only serves to add to the
> meaning of the words but is not intended to give a broad text resource (as
> a corpus would do).
>
>    I think that the perhaps it would be good to find some simple criteria
> to classify models that would clarify the kind of data in certain
> circumstances, such as:
>
> Could I expect to find an entry (page) for a single lemma word (e.g.,
> "cat"?) => Lexicon
>  Could this resource be (directly) reduced to a list of terms? => Lexicon
>  Does this resource consists primarily of sentences? => Corpus
>  Is there annotation directly on sentences? => Corpus
>  Does this resource define the set of part of speech as part of its data
> (not schema)? => Language Database
>  Does this resource describe languages (instead of words in that
> language)? => Language Database
>  etc.
>
>
> I agree with these test questions for the three categories lexicon, corpus
> and language database. Infact the answers for some of the datasets is “no“
> for all three categories. These concern mostly the data sets I put into my
> ontology category, because I think they differ fundamentally from the
> others.
>
>
>  Regards,
> John
>  Â
>
>> Looking forward to your comments,
>> Bettina
>>
>>
>>
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>>
>> _______________________________________________
>> open-linguistics mailing list
>> open-linguistics at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-linguistics
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>>
>
>
>
> _______________________________________________
> open-linguistics mailing listopen-linguistics at lists.okfn.orghttp://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>
>
>
> --
> Bettina Klimek
>
> Universität Leipzig
> Institut für Informatik
> Augustusplatz 10, 04109 Leipzig
>
> e-mail: klimek at informatik.uni-leipzig.de
>
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20131125/b5d3fa8e/attachment-0003.html>


More information about the open-linguistics mailing list