[open-linguistics] How to represent LLOD diagram categories at datahub ?

Mon Oct 7 21:08:39 UTC 2013

Sebastian,

towards the idea of resource classification:  I did some digging around and some thinking on the idea of a corpus. I think it has been stated here that minimally a corpus is a two part resource, though these may not be more than one file/object. part 1 is a original data object and part 2 is an annotation. (Though a corpus may also include several of these Data+Annotation pairings. - and possibly we need to invoke a new term for a set of associated pairings, or a new term for a single pairing instance.)  I thing the classic case from Language Documentation would be 10 audio recordings (and annotations) of short stories from language [xyz]. Do these constitute a single corpus or are they each a corpus. Examples of single pairings might include an audio file and a Praat annotation tier file, or a video file and an .eaf ELAN file, or the English text of Wikipedia and some sort of grammatical parsing output.

 It occurs to me that "we in linguistics" probably are not the only ones who have encountered data sets like corpora before. So, my question is these other domains: "How have they described a data+annotation units"? I think one thing in this discussion to keep in focus is form v.s. function. Some categorization schemas are more function oriented others are more form oriented - and truly both are important for building useful services off of linked data. Corpus I feel is one of those terms which in linguistics has multiple functional definitions depending if the approach is NLP or Language Documentation, or etc.  So, I was looking for something more form oriented. I am not sure I have found anything but I did find some interesting discussion in the ORE and FaBiO ontologies.  The developers of the FaBiO, the FRBR-aligned Bibliographic Ontology, discuss part of the point here in this post: http://opencitations.wordpress.com/2011/06/30/nomenclature-for-data-publications-and-citations/. Though I am sure that this data+annotation pairing also has been addressed in the DNA or medical fields areas of linked data and data storage. Does anyone know what patterns of description are being used in these cases?

For What Its Worth....

- Hugh

On Oct 6, 2013, at 4:37 AM, hellmann at informatik.uni-leipzig.de wrote:

> Let a thousand ontologies blossom!
> 
> I am in favor of creating several different colorings with the potential to add your own. 
> 
> This can be modelled by:
> A dimension/aspect of the coloring.
> Then we need an assignment value->color. 
> 
> The reason for this is that I would like to make diagrams with different colors. One aspect could be for hosting/boasting purposes, e.g. which institute/company is hosting the data to give credit.
> 
> This gives us pretty good features, i.e. we can make a heat map with nr of described languages as dimension. 
> 
> Furthermore, I am a big opponent of classifications and a great fan of criteria. One clear criteria is whether the dataset contains primary data. This would qualify it partially as a corpus in my opinion. There are some fringe cases of course, i.e. dictionaries citing sentences from newspapers as example. So based on the 'contains primary data property', corpora could be defined as 'datasets that have primary data and annotations relating to this primary data' 
> 
> Lexica 'may or may not contain primary data, but the primary data is an annotation for the main content, i.e. the entries in a dictionary are annotated by newspaper examples.
> 
> 
> We probably should discuss it on this level ,i.e. what kind of differently colored clouds do we like to have, what dimensions or aspects do we need and what kind of metadata do we need to collect. 
> 
> Other than that I would prefer prettiness as a main criteria for the official LLOD cloud. Let's say 4-6 colors which are pleasing to the eye ;) We probably do not have to make a science out of it and leave it fuzzy for now. 
> 
> @Hugh: we should aim at creating a consentual framework for resource classification eventually...
> 
> --Sebastian
> 
> 
> 
> 
> Sebastian Nordhoff <sebastian_nordhoff at eva.mpg.de> wrote:
> On Sat, 05 Oct 2013 12:26:43 +0200, Christian Chiarcos  
> <christian.chiarcos at web.de> wrote:
> 
> Dear all,
> 
> earlier, we discussed categories for coloring the LLOD diagram. The  
> diagram we prepared for LDL-2013 was based on a something like the  
> minimal consensus:
> 
> - lexicon (= LREMap lexicon, olac:lexicon)
> - corpus (= LREMap corpus, ~ olac:primary data)
> - language_description (basically everything else, ~  
> olac:language_description)
> 
> I guess the first two are unproblematic, but the third is very  
> heterogeneous, it includes
> - terminology repositories
> - typological databases
> - bibliographical databases
> In a way, all of these "describe language" (information about languages,  
> information about concepts relevant to the description of langu
>  age, 
> 
> information about collections of language data), but honestly, I would  
> prefer the label "other", because this is very different from what I  
> think an olac:language_description is meant to be.
> 
> As far as I can see, a language description would be a (sketch) grammar or  
> a learner's manual or similar. I think we have none of those in the LLOD  
> cloud (though we might in the future). olac:language_description does not  
> seem to be a good choice there.
> 
> I agree with Christian that there is not a lot of internal coherence in  
> group 3. What would be the reason against having 5 groups, rather than 3?  
> The typological databases group nicely, and I intend to add some more  
> typological databases over the next months.  Terminology repositories can  
> also be grouped. This only leaves Glottolog as the odd one out, and we can  
> call it "other".
> 
> I suppose we will have to have some labels for 
>  groups
> 3a and 3b, which  
> should be dereferenceable. Is there not something like xyz:tabulardata for  
> typological databases which we could subclass?
> 
> Best
> Sebastian
> 
> 
> Two questions
> - Is this general classification acceptable ?
> - How shall we encode the categories ? Using tags "lexicon", "corpus",  
> etc. ? Or using a custom field "LLOD category" ? Unless anyone protests,  
> I would suggest to use tags for "lexicon" and "corpus" and classify  
> everything without such a tag as "language_description".
> 
> Best,
> Christian
> 
> 
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
> 
> 
> -- 
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20131007/7dcb0387/attachment-0001.html>