[open-linguistics] CC licenses

Sebastian Nordhoff sebastian_nordhoff at eva.mpg.de
Thu May 31 21:27:12 UTC 2012

On Thu, 31 May 2012 15:33:08 +0300, Brian MacWhinney <macw at cmu.edu> wrote:

> Dear Open-Linguistics,
>     I have just now subscribed to this list, based on urging from  
> Sebastian Hellman.  I was interested in the idea of incorporating the  
> CHILDES and TalkBank corpora for spoken language into LLOD and Sebastian  
> asked me why we were relying on the CC-NC license, rather than the  
> CC-BY-SA license.  I told him that the basic motivation involved the  
> feelings of the people who had contributed data to the corpus.  Our data  
> include audio and video and transcripts  from children, students,  
> aphasics, etc. across many languages.  What we would like to avoid is  
> the possibility that someone would find that a company was "making  
> money" from the audio or video of their children or parents without  
> properly asking them.  We are not interested in any commercial interests  
> ourselves.  Isn't CC-NC the right choice in this case? Is this a problem  
> for the goals of LOD?  In general, we try to make our data as freely  
> available to researchers as possible without any sort of license.  A  
> small fraction of the corpora (3%) are password protected, but the  
> others are not.

Dear Brian,
thanks for raising this issue. CC-NC has a very attractive name, and many  
people feel that what they are doing is noncommercial in nature. As a  
consequence, they choose CC-NC. The scenario in language documentation  
(where I come from) is always that Disney takes some folk dance and makes  
a cartoon out of it. This is what researchers want to avoid.

The problem is that CC-NC does not only rule out Disney, but also
- people who use the data on their website and have ads on it, generating  
(minimal) revenue. This is commercial
- people who use the data to train their corpora and then sell the  
software. This is commercial
- people who use the data to refine their search engines. This is  
commercial if the search engine makes money
- companies who provide an app for free which uses the data. The company  
is a commercial entity, so the usage is commercial (I believe)

Furthermore, the vision of Linked Open Data is to aggregate data from very  
many sources. If only a tiny bit of NC is in the aggregate, the whole  
thing basically becomes NC.

I share your feelings about video, and to a certain extent audio. These  
are very personal data, and privacy seems to be main issue here, rather  
than the possibility to generate revenue. I would probably only release  
those data as CC-No Derivatives, but YMMV.

Textual data, especially of the banal nature of everyday speech, is much  
less sensitive in this regard. For most textual data, I find it hard to  
imagine a scenario where a stretch of conversation could end up in a  
commercial product and cause offense. To put it bluntly: Disney is not  
interested in transcripts of Mummy and Baby talking about bananas.

The take-home message for me is that CC-NC has a very appealing name, but  
it misleads people with respect to what it actually covers. CC-NC should  
never be the default choice, but there are circumstances where it is the  
right license. In the case of Glottolog for instance, we have to use CC-NC  
because otherwise we could not use Australian data because of some  
cultural sensitivities there. For Childes, the question would be: are  
there any reasons to not use simple CC-BY for textual data, audio, and  
video? What are these reasons, and is there a way to achieve the same goal  
without using CC-NC?

The discussion about the right licenses for data involving human subjects  
has only just started, and the issue is far from settled.

Best wishes

> -- Brian MacWhinney
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics

More information about the open-linguistics mailing list