[open-linguistics] Creation of a joint linguistic LOD cloud

Thu Nov 3 19:32:00 UTC 2011

Of course, I will contribute with my ontologies for linguistic annotations  
that formalize a number of annotation schemes and link them to ISOcat and  
GOLD: http://purl.org/olia. (Available online, will be published under  
CC-BY as soon as the reference publication has appeared.)

Christian

On Thu, 03 Nov 2011 20:22:47 +0100, Christian Chiarcos  
<christian.chiarcos at web.de> wrote:

> On Thu, 03 Nov 2011 16:28:53 +0100, Nancy Ide <ide at cs.vassar.edu> wrote:
>
>> For those of us who were not at the meeting, which type of datasets do  
>> you want?
>
> We actually had this discussion briefly at the meeting, as well  
> (although I missed a lot, participating via skype only). The general  
> idea was to accept everything that can be reasonably linked to other  
> linguistic resources such as corpora, dictionaries, thesauri, plain word  
> lists, collocation data, etc. In the end, the contributors will decide  
> about the actual definition, we wouldn't rule out anything.
>
> The crucial point is whether the data can be assumed to be usefully  
> linked with other people's data, with is certainly true for corpora and  
> lexical-semantic resources, but possibly not for results from  
> psycholinguistic experiments, which are tied to a particular setup and  
> stimuli (unless someone objects).
>
> As for myself, I am particularly interested in modeling linguistic  
> corpora, and I can provide a corpus in RDF, with OWL/DL-defined data  
> types. I also thought about converting MASC for this purpose. Other  
> possibilities would be (parts of) the Open Parallel corpus  
> (http://opus.lingfil.uu.se) or the Copenhagen Dependency Treebank  
> (http://code.google.com/p/copenhagen-dependency-treebank).
> @Nancy: Is the RDF representation of the MASC already available online ?  
> If so, I would focus on one of the latter corpora.
>
> A second question is how large the datasets have to be. Again, we  
> wouldn't prescribe anything, so, the provider himself has to decide  
> whether the amount of data (s)he provides represents a reasonable  
> starting base. For example for richly annotated corpora, already small  
> samples could be of interest as the community still has to work out  
> schemes to represent linguistic annotations (say, parallel corpora, or  
> coreference-annotated corpora) in RDF and RDF-based formalisms properly.
>
> Best,
> Christian