[open-linguistics] Collection of resources

Nancy Ide ide at cs.vassar.edu
Sat Jan 15 18:57:58 UTC 2011


Hi everyone,

First, I am very open to the idea that I may be misinterpreting something, but if so, it is true of many of my colleagues in the field as well. If it turns out that we are wrong about copyleft data, then that would in fact be a great advantage for us!

So let me outline the case in Computational linguistics and see what you all think. There are several kinds of linguistic data that could be considered, but the major ones are annotated corpora and lexical resources (e.g. WordNet etc.). We use annotated corpora for building language models, so we need the whole corpus (not just small "fair use" bits of it as are provided through concordancing services) to determine patterns. So one of the first concerns is whether or not it is "legal" for a researcher who has done some work using a particular annotated corpus to re-distribute the corpus in its entirely for use by anyone else, which would, for example, enable replication of the researcher's results, but more importantly, save others the enormously high cost of re-annotating it for their own use. As far as we believe at the moment, it is not possible to put a corpus up on your website or otherwise distribute it freely and without restriction to non-commercial use, unless the data in it are copyright-free or possibly distributed originally under some sort of attribution-only license. If you re-distribute it under a copyleft license in accordance with the original license, then commercial users may not be able to use it to develop commercial products--especially if the data are in some way directly incorporated into the product and thus re-distributed under different terms. This fact (or belief, as the case may be!) has led to a situation where the only corpora that are available for widespread use and re-use in computational linguistic research are severely domain-limited corpora such as the Wall Street Journal corpus (a situation the OANC is trying to remedy). 

Commercial use comes more clearly into the discussion when we consider lexical etc. resources, like WordNet. Because WordNet is really open, you can put it in your product and then distribute your product under any terms you want. If the license were copyleft, however, it is the assumption that a commercial user cannot put it in a product and then sell it. A similar case is extracted data, for example, named entities (people, place, organization names, etc.). If these were extracted from, say Wikipedia, which is distributed under a copyleft license, two questions arise: (1) can you distribute the derived data without restriction, or should it be distributed also under copyleft? and (2) if distributed under copyleft, can you use the extracted data directly in a commercial product and then sell it?

To make this all more concrete, let me provide the example of our own case: What we want to do in particular is to go out on the web, find data distributed under copyleft, and make it a part of the OANC, which is distributed free of any restriction on its use or re-distribution. Can I do this, or do I have to re-distribute it with a copyleft license? If so, does this in any way limit commercial use, or otherwise limit its use in any way? Since we don't know for sure, we stay away from copyleft data, but it would certainly be a big help if that data could be used.

Sebastian made a very good point:

> So it is more a policy issue whether you want to make it more difficult for people to lock away open data, as with a "share-alike" license they would have to ask the copyright holder.
> I'm not sure about which type of "commercial users" you are talking about. If they create traditional lock-in products then the "share-alike" is not good for business and means de-facto non-commercial.

I.e., copyleft is intended to ensure that such data are not re-distributed under a *more* restricted license, which could in fact prevent some commercial use. But it also precludes it being distributed under a *less* restricted license. Therefore, we feel we cannot re-distribute copyleft data as part of the OANC because we do not put any restriction on the use of the corpus (except attribution). 

I am happy to hear where all the holes in my argument and thinking are! I look forward to your comments.

Best,
Nancy


 



More information about the open-linguistics mailing list