[open-linguistics] How to represent LLOD diagram categories at datahub ?

Sebastian Hellmann hellmann at informatik.uni-leipzig.de
Tue Oct 8 12:33:19 UTC 2013


Hi all,
I think, we need to close into this from two sides:

1. what are our use cases?
I am sure that you can have many separations in dataset, e.g. by file, 
by language, by content. OLiA e.g. has many different layers and files, 
but in the end we only want 1 Olia bubble in the cloud and not 170. 
Nevertheless we would like to be able to automatically index all 
resources belonging to Olia. I really like the CKAN model 
(http://datahub.io) with dataset entry and resources. Data entry is not 
optimal. I will ask on their list how they intend to manage resources in 
bulk.


2. what definitions do exist?
I am aware that everybody always prays to reuse existing work. From my 
experience, I would say that existing work is not really reusable in a 
sustainable way. This has numerous reason. Normally, it is something 
like "insufficiently defined or documented definitions", "missing 
license", "proprietary license", "inappropriate definition (too 
granular, too vague)".

As a solution, I would say, that creating our own classification and 
criteria is the most feasible option. Of course this can be inspired by 
other work and we should scrap what we can to save work. This will also 
produce interlinking to existing definitions right from the start. For 
example the FABIO model seems to be to too complex, where we don't need 
it (e.g.  distinction between work and manifestation) and too vague 
where we would need more granularity:  fabio:ComputerFile (a frbr:Item)  .


All the best,
Sebastian

Am 07.10.2013 23:08, schrieb Hugh Paterson III:
> Sebastian,
>
> towards the idea of resource classification:  I did some digging 
> around and some thinking on the idea of a corpus. I think it has been 
> stated here that minimally a corpus is a two part resource, though 
> these may not be more than one file/object. part 1 is a original data 
> object and part 2 is an annotation. (Though a corpus may also include 
> several of these Data+Annotation pairings. - and possibly we need to 
> invoke a new term for a set of associated pairings, or a new term for 
> a single pairing instance.)  I thing the classic case from Language 
> Documentation would be 10 audio recordings (and annotations) of short 
> stories from language [xyz]. Do these constitute a single corpus or 
> are they each a corpus. Examples of single pairings might include an 
> audio file and a Praat annotation tier file, or a video file and an 
> .eaf ELAN file, or the English text of Wikipedia and some sort of 
> grammatical parsing output.
>
>  It occurs to me that "we in linguistics" probably are not the only 
> ones who have encountered data sets like corpora before. So, my 
> question is these other domains: "How have they described a 
> data+annotation units"? I think one thing in this discussion to keep 
> in focus is form v.s. function. Some categorization schemas are more 
> function oriented others are more form oriented - and truly both are 
> important for building useful services off of linked data. Corpus I 
> feel is one of those terms which in linguistics has multiple 
> functional definitions depending if the approach is NLP or Language 
> Documentation, or etc.  So, I was looking for something more form 
> oriented. I am not sure I have found anything but I did find some 
> interesting discussion in the ORE and FaBiO ontologies.  The 
> developers of the FaBiO, the FRBR-aligned Bibliographic Ontology, 
> discuss part of the point here in this post: 
> http://opencitations.wordpress.com/2011/06/30/nomenclature-for-data-publications-and-citations/. 
> Though I am sure that this data+annotation pairing also has been 
> addressed in the DNA or medical fields areas of linked data and data 
> storage. Does anyone know what patterns of description are being used 
> in these cases?
>
> For What Its Worth....
>
> - Hugh
>
>
>
> On Oct 6, 2013, at 4:37 AM, hellmann at informatik.uni-leipzig.de 
> <mailto:hellmann at informatik.uni-leipzig.de> wrote:
>
>> Let a thousand ontologies blossom!
>>
>> I am in favor of creating several different colorings with the 
>> potential to add your own.
>>
>> This can be modelled by:
>> A dimension/aspect of the coloring.
>> Then we need an assignment value->color.
>>
>> The reason for this is that I would like to make diagrams with 
>> different colors. One aspect could be for hosting/boasting purposes, 
>> e.g. which institute/company is hosting the data to give credit.
>>
>> This gives us pretty good features, i.e. we can make a heat map with 
>> nr of described languages as dimension.
>>
>> Furthermore, I am a big opponent of classifications and a great fan 
>> of criteria. One clear criteria is whether the dataset contains 
>> primary data. This would qualify it partially as a corpus in my 
>> opinion. There are some fringe cases of course, i.e. dictionaries 
>> citing sentences from newspapers as example. So based on the 
>> 'contains primary data property', corpora could be defined as 
>> 'datasets that have primary data and annotations relating to this 
>> primary data'
>>
>> Lexica 'may or may not contain primary data, but the primary data is 
>> an annotation for the main content, i.e. the entries in a dictionary 
>> are annotated by newspaper examples.
>>
>>
>> We probably should discuss it on this level ,i.e. what kind of 
>> differently colored clouds do we like to have, what dimensions or 
>> aspects do we need and what kind of metadata do we need to collect.
>>
>> Other than that I would prefer prettiness as a main criteria for the 
>> official LLOD cloud. Let's say 4-6 colors which are pleasing to the 
>> eye ;) We probably do not have to make a science out of it and leave 
>> it fuzzy for now.
>>
>> @Hugh: we should aim at creating a consentual framework for resource 
>> classification eventually...
>>
>> --Sebastian
>>
>>
>>
>>
>> Sebastian Nordhoff <sebastian_nordhoff at eva.mpg.de 
>> <mailto:sebastian_nordhoff at eva.mpg.de>> wrote:
>>
>>     On Sat, 05 Oct 2013 12:26:43 +0200, Christian Chiarcos
>>     <christian.chiarcos at web.de  <mailto:christian.chiarcos at web.de>> wrote:
>>
>>         Dear all, earlier, we discussed categories for coloring the
>>         LLOD diagram. The diagram we prepared for LDL-2013 was based
>>         on a something like the minimal consensus: - lexicon (=
>>         LREMap lexicon, olac:lexicon) - corpus (= LREMap corpus, ~
>>         olac:primary data) - language_description (basically
>>         everything else, ~ olac:language_description) I guess the
>>         first two are unproblematic, but the third is very
>>         heterogeneous, it includes - terminology repositories -
>>         typological databases - bibliographical databases In a way,
>>         all of these "describe language" (information about
>>         languages, information about concepts relevant to the
>>         description of langu age, information about collections of
>>         language data), but honestly, I would prefer the label
>>         "other", because this is very different from what I think an
>>         olac:language_description is meant to be.
>>
>>
>>     As far as I can see, a language description would be a (sketch) grammar or
>>     a learner's manual or similar. I think we have none of those in the LLOD
>>     cloud (though we might in the future). olac:language_description does not
>>     seem to be a good choice there.
>>
>>     I agree with Christian that there is not a lot of internal coherence in
>>     group 3. What would be the reason against having 5 groups, rather than 3?
>>     The typological databases group nicely, and I intend to add some more
>>     typological databases over the next months.  Terminology repositories can
>>     also be grouped. This only leaves Glottolog as the odd one out, and we can
>>     call it "other".
>>
>>     I suppose we will have to have some labels for
>>       groups
>>     3a and 3b, which
>>     should be dereferenceable. Is there not something like xyz:tabulardata for
>>     typological databases which we could subclass?
>>
>>     Best
>>     Sebastian
>>
>>
>>         Two questions - Is this general classification acceptable ? -
>>         How shall we encode the categories ? Using tags "lexicon",
>>         "corpus", etc. ? Or using a custom field "LLOD category" ?
>>         Unless anyone protests, I would suggest to use tags for
>>         "lexicon" and "corpus" and classify everything without such a
>>         tag as "language_description". Best, Christian
>>
>>
>>     ------------------------------------------------------------------------
>>
>>     open-linguistics mailing list
>>     open-linguistics at lists.okfn.org  <mailto:open-linguistics at lists.okfn.org>
>>     http://lists.okfn.org/mailman/listinfo/open-linguistics
>>     Unsubscribe:http://lists.okfn.org/mailman/options/open-linguistics
>>
>>
>> -- 
>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail 
>> gesendet.
>> _______________________________________________
>> open-linguistics mailing list
>> open-linguistics at lists.okfn.org <mailto:open-linguistics at lists.okfn.org>
>> http://lists.okfn.org/mailman/listinfo/open-linguistics
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>
>
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
* NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Extended 
Deadline: *July 18th*)
* LSWT 23/24 Sept, 2013 in Leipzig (http://aksw.org/lswt)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20131008/311ef46f/attachment-0001.html>


More information about the open-linguistics mailing list