[open-linguistics] Creation of a joint linguistic LOD cloud

Thu Nov 3 19:22:47 UTC 2011

On Thu, 03 Nov 2011 16:28:53 +0100, Nancy Ide <ide at cs.vassar.edu> wrote:

> For those of us who were not at the meeting, which type of datasets do  
> you want?

We actually had this discussion briefly at the meeting, as well (although  
I missed a lot, participating via skype only). The general idea was to  
accept everything that can be reasonably linked to other linguistic  
resources such as corpora, dictionaries, thesauri, plain word lists,  
collocation data, etc. In the end, the contributors will decide about the  
actual definition, we wouldn't rule out anything.

The crucial point is whether the data can be assumed to be usefully linked  
with other people's data, with is certainly true for corpora and  
lexical-semantic resources, but possibly not for results from  
psycholinguistic experiments, which are tied to a particular setup and  
stimuli (unless someone objects).

As for myself, I am particularly interested in modeling linguistic  
corpora, and I can provide a corpus in RDF, with OWL/DL-defined data  
types. I also thought about converting MASC for this purpose. Other  
possibilities would be (parts of) the Open Parallel corpus  
(http://opus.lingfil.uu.se) or the Copenhagen Dependency Treebank  
(http://code.google.com/p/copenhagen-dependency-treebank).
@Nancy: Is the RDF representation of the MASC already available online ?  
If so, I would focus on one of the latter corpora.

A second question is how large the datasets have to be. Again, we wouldn't  
prescribe anything, so, the provider himself has to decide whether the  
amount of data (s)he provides represents a reasonable starting base. For  
example for richly annotated corpora, already small samples could be of  
interest as the community still has to work out schemes to represent  
linguistic annotations (say, parallel corpora, or coreference-annotated  
corpora) in RDF and RDF-based formalisms properly.

Best,
Christian