[open-linguistics] How to represent LLOD diagram categories at datahub ?

Christian Chiarcos christian.chiarcos at web.de
Mon Nov 25 16:46:01 UTC 2013


2013/11/25 Bettina Klimek <klimek at informatik.uni-leipzig.de>:
> On 11/22/2013 04:56 PM, John P. McCrae wrote:
>
> Hi Bettina
>
>
> On Fri, Nov 22, 2013 at 5:54 AM, Bettina Klimek
> <klimek at informatik.uni-leipzig.de> wrote:
>>
>> Dear all,
>>
>> at the last telco I agreed to categorize all the data sets in the datahub.
>> In the telco we set up the following three categories:
>>
>> Lexicon ? based on data containing words
>> Corpus ? based on data containing text
>> Linguistic Database ? based on data containing languages
>>
>> After I looked up every single data set in the datahub several problems
>> occurred:
>>
>> 1) some data sets seem to be wrong/spam
>
> Yep, that is the purpose of the "upgrade" of DataHub to remove some spam
>
>
> It seems like someone already deleted the spam data sets :)
>
>
>
>> 2) the source URLs in some data sets do not work or are missing
>> 3) there seem to be less data sets at datahub than there are bubbles in
>> the cloud diagram
>
> The cloud diagram is generated from datahub.io... it is not possible that
> there are more bubbles than are present in the system
>
>
> then there must be all the other data sets somewhere else on datahub, I only
> looked at the ones in the linguistics group
> (http://www.datahub.io/organization/linguistics) because I assumed to find
> all data sets there
>
>> 4) some data sets are not linked yet
>>
>> Concerning these issues I would like to know if there is someone managing
>> and maintaining the correctness and completeness of the data set entries at
>> datahub? I think this is a matter of quality that should not be ignored.
>
> Well, we discussed this at the last telco... the conclusion is that we
> should look into the development of a new repository, which does things like
> check that URLs are valid and automatically counts things like the triple
> count and link count.
>
> Also as has been noted DataHub has been recently upgraded to "unusable", so
> a better service would be highly useful.
> Â
>>
>>
>> 5) a categorization in terms of the three categories defined above does
>> not work for all the data sets
>>
>> While going through all the data sets manifold questions concerning the
>> LLOD cloud as such occurred to me. The most important one is: what is
>> linguistic data? It is not defined anywhere on the website and the data sets
>> reveal all kinds of data which differ in their information provided, purpose
>> of representation, data types they supply to the user as well as in the
>> linguistic content. To me it seemed as if the categorization of the data
>> sets is taking the second step without the first. The first step should be
>> to  set up a definition of what linguistic data is and as a result which
>> kind of data sets are to be expected as belonging into the cloud.
>>
>>
>> When thinking of the category labels I tried to find some that would be
>> broadly accepted by the users. However there seems to be no ideal user. A
>> user could be a linguist interested in human languages, a computer linguist,
>> a computer scientist or someone interested in NLP (even more different users
>> are thinkable). The data sets mirror the diversity of users within the data
>> they provide. I think it is impossible to find category labels which can
>> cover all the special research areas linguistic data could be used for. That
>> is why I tried to establish categories everybody understands intuitively.
>> Thus I have assigned the data sets to the following 5-6 categories, which
>> are defined as follows:
>>
>> LLOD data set categories:
>> 1) Data representing words = Lexicon
>> 2) Data representing texts = Corpus
>> 3) Data representing information about languages = Language Database
>> 4) Data derived derived from but conserving original data = Ontology
>
> Â How is this an "ontology"? I would call it something like "statistical
> data"
>
>
> I used “ontology“ here because this is what the data sets in this catagory
> contain. I couldn’t come up with a better category label for all the data
> sets that are mainly interested in providing rdf or owl versions of other
> data and I thought with the label “ontology“ the user group looking only for
> such data could identify it easier that way. Maybe the definition should be
> changed then into “data representing other data as semantic web technology“
> or simply name it "derived data"? But in principle to me it is just data
> which is derived from other data (which in most cases isn’t even provided).
>
>
>
>> 5) Data representing further knowledge about the data (such as
>> bibliographical references) = Meta Data
>> 6) maybe: category for further relevant data sets not directly concerned
>> with linguistic data
>>
>> As you can see I kept the first two categories because they are
>> straightforward and users intuitively know what a lexicon and a corpus are.
>> I established category #3 for those data sets dealing with certain
>> information on a large number of languages which cannot easily be understood
>> as lexicon or corpus. This holds for PHOIBLE and ASJP. The former lists
>> phonological segments that are below the word and text level and which are
>> provided for many languages. The latter is a language classification based
>> on lexical similarities (which are not given though). Another case is
>> glottolog being a language classification as well but additionally providing
>> relevant bibliographical sources. Since these references are set up in a
>> distinct application (Langdoc) glottolog is one special case covering two
>> categories: Language Database and Meta Data. Another case, even more
>> diverse, is the Multext-East data set which seems to cover the categories
>> Lexicon, Corpus as well as Ontology. We can debate about the assignment of a
>> data set to two categories. Maybe it is better to force everybody uploading
>> data sets to select for only one default category.
>>
>> Furthermore the assignment of the data sets to the categories is based on
>> what is to be provided primarily! The ASIt for instance evokes the
>> impression that some kind of Italian corpora are the content of the data
>> set. However it is rather interested in presenting the analysis of various
>> corpora (which are not provided as such!) in form of rdf-schema. Data sets
>> like these are the reason why I set up category #4. These are problematic
>> because they are based on human language data but converted into a machine
>> readable language and by doing so totally changing the user interested in
>> the data. The linguist would be interested in a user friendly searchable
>> language corpus and the computer linguist/scientist is only concerned with
>> the rdf data for instance. To solve this problem I assigned data sets
>> primarily providing human language data (even if there is also an rdf
>> verision) to the categories #1 to #3 and data sets being concerned with
>> machine language in any way (only using human language material in order to
>> do so) to category #4.
>>
>> I think it is essential to put these definitions into the cloud (maybe as
>> tooltips), because the Meta Data category for instance will be associated
>> with different information depending on whether a user is concerned with
>> human language or with computer languages.
>>
>> My classification of the data sets can be seen at:
>>
>> https://docs.google.com/spreadsheet/ccc?key=0AkVaxylrRsewdGwtckloTlYyZ25iY2duTkQwdlZBaHc&usp=sharing
>
> There seem to be a lot of odd classifications here:
>
> Leipzig Corpus Collection: Corpus not lexicon
>
>
> thank you, corrected it
>
> ISOcat: This is either metadata or language database (I am not sure of the
> distinction), it is certainly not a lexicon
>
>
> since the entries are words it looked like a lexicon to me. I don't think
> it's a language database because it contains no languages. If others
> disagree on the lexicon category I would go for meta data (but I would
> answer on your category question “Could this resource be (directly) reduced
> to a list of terms? => Lexicon“ with "yes" as well and therefore stay in the
> lexicon category)
>
>
>
> Rosetta Project: This is more a lexicon than a language database
>
>
> I would answer your “Could this resource be (directly) reduced to a list of
> terms? => Lexicon“ clearly with no. I found on rosetta.org texts and
> language classification
>
>
>
> SIMPLE, lemonWordnet, lemonWiktionary, lemonUby: These are lexica and the
> URLs are clearly stated in the datahub source
>
>
> to me an RDF conversion of a lexicon is no lexicon anymore
>
> SIMPLE: no URL here http://www.datahub.io/dataset/simple
>
> lemonWordnet: no URL here http://www.datahub.io/dataset/lemonwordnet
>
> LemonWiktionary: no URL here http://www.datahub.io/dataset/lemonwiktionary
>
> lemonUby: no URL here http://www.datahub.io/dataset/lemonuby
>
> - this seems to be my fault as I assumed the source URL would be in the
> “additional info”-table in the “source”-field at the bottom on the
> “main”-datahub entry page, just as they are for most of the other data sets
> (I thought the source URL could be found consistently at the same place in
> that table..)
>
>
>
> Lemon: This is a schema, not data (hence why it isn't in Datahub)
>
>
> I deleted it out of the google spreadsheet
>
> WikiWord, WordNet 3.0, LODAC BDLS: These are also lexica, surely?
>
>
> WikiWord is only a tool to build a lexicon but no lexicon as such
>
> Wordnet3.0 is again only an RDF conversion of a lexicon (the princeton
> wordnet) and therefore no lexicon to me
>
> LODAC BDLS is a “linked data version“ of the BDLS, so no lexicon as such
> anymore
>
>
>
> OLiA, lingvoj, LexInfo: These, much as ISOcat, provide categories for
> linguistic annotation, these resource should all be in the same category
>
>
> unlike ISOcat these data sets contain no single word entry, again the user
> gets ontologies here and no lexicons (your first two questions for the
> lexicon category below have to be negated here)
>
>
> Alpino, Semantic Quran, Multext-East: These are corpora, surely?
> Â
>
>
> I am still unsure with Multext-East, but Alpino and Semantic Quran are again
> only RDF versions, hence no corpora. The original RDF underlying data might
> have been corpora but an RDF version is no corpus (to me)
>
>
>
>>
>> There you can also see some ?unsure data sets? such as DBpedia. I think
>> these are corresponding to the ?resources used to assist and augment
>> language processing applications, even if the nature of the resource is not
>> deeply entrenched in Linguistics, but only as long as the usefulness is well
>> motivated? as explained on:
>> http://www.semantic-web-journal.net/blog/call-multilingual-linked-open-data-mlod-2012-data-post-proceedings.
>> I couldn?t come up with an appropriate category label for them yet (maybe
>> ?indirectly relevant resources??!). And I am not sure if they should be in
>> the cloud at all because the data sets are already so heterogeneous that
>> adding not primarily linguistic data sets might cause even more confusion to
>> the user and maybe also frustration when he won?t find only linguistic data.
>> Keeping the LLOD cloud as a pure linguistic data cloud and providing a
>> possibility to link to the LOD cloud (which already contains the ?unsure?
>> data sets) could be a practical option here.
>>
>> This is just another proposal for a possible classification. I think it is
>> efficient enough to also cover data sets which are to come in the future. I
>> know it is very broad and not very specific, but since I am of the opinion
>> that further sub-classifications need to be developed with regard to the
>> different linguistic user types, we can work on that more carefully later.
>> As a first impression of the cloud the 5-6 categories I proposed here should
>> be sufficient to get an overview of the main cloud content.
>>
> I think it would be best to prefer very broad categories, things can get
> fuzzy quite quickly, (for example WordNet contains examples and definition
> sentences is it therefore a corpus?)
> Â
>
>
> I think wordnet is not fuzzy, if the main aim is centred around words it
> can't be a corpus. Giving some example sentences only serves to add to the
> meaning of the words but is not intended to give a broad text resource (as a
> corpus would do).
>
> I think that the perhaps it would be good to find some simple criteria to
> classify models that would clarify the kind of data in certain
> circumstances, such as:
>
> Could I expect to find an entry (page) for a single lemma word (e.g.,
> "cat"?) => Lexicon
> Could this resource be (directly) reduced to a list of terms? => Lexicon
> Does this resource consists primarily of sentences? => Corpus
> Is there annotation directly on sentences? => Corpus
> Does this resource define the set of part of speech as part of its data (not
> schema)? => Language Database
> Does this resource describe languages (instead of words in that language)?
> => Language Database
> etc.
>
>
> I agree with these test questions for the three categories lexicon, corpus
> and language database. Infact the answers for some of the datasets is “no“
> for all three categories. These concern mostly the data sets I put into my
> ontology category, because I think they differ fundamentally from the
> others.
>
>
> Regards,
> John
> Â
>>
>> Looking forward to your comments,
>> Bettina
>>
>>
>>
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>>
>> _______________________________________________
>> open-linguistics mailing list
>> open-linguistics at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-linguistics
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>
>
>
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>
>
>
> --
> Bettina Klimek
>
> Universität Leipzig
> Institut für Informatik
> Augustusplatz 10, 04109 Leipzig
>
> e-mail: klimek at informatik.uni-leipzig.de
>
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>



More information about the open-linguistics mailing list