[open-linguistics] CC licenses

Brian MacWhinney macw at cmu.edu
Thu May 31 21:58:48 UTC 2012


Sebastian,
    This is not just about worrying that someone might make some money.  It is more about worrying about how non-academics will represent data.  I think that many academics, including myself, do not trust companies to follow reasonable ethical guidelines.  In part, this arises from cases in which academics have gone over the "dark side" in terms of setting up businesses based on selling various types of snake oil, based on supposed interpretations of academic facts.  This has happened in about five instances in the area of child language and it is also a major issue for second language research.  If possible, it might be best not to mention the specific people and companies involved, partly because that would distract from the basic issue.
    If there were a way to verify that companies were treating data in a responsible academic way, then I am sure that the people who have contributed their corpora would have zero problem with CC-BY-SA.  For example, it is fine that Adam Kilgaraff has included CHILDES and TalkBank data in his SketchEngine concordance system.  Sure, he makes some money, but there is no possible misrepresentation of the data in this framework.
   Of course there are additional issues about audio and video, but those are privacy issues that we have to deal with in CHILDES and TalkBank in the first place through password control, anonymization, etc.

-- Brian MacWhinney

On May 31, 2012, at 11:27 PM, Sebastian Nordhoff wrote:

> On Thu, 31 May 2012 15:33:08 +0300, Brian MacWhinney <macw at cmu.edu> wrote:
> 
>> Dear Open-Linguistics,
>>    I have just now subscribed to this list, based on urging from Sebastian Hellman.  I was interested in the idea of incorporating the CHILDES and TalkBank corpora for spoken language into LLOD and Sebastian asked me why we were relying on the CC-NC license, rather than the CC-BY-SA license.  I told him that the basic motivation involved the feelings of the people who had contributed data to the corpus.  Our data include audio and video and transcripts  from children, students, aphasics, etc. across many languages.  What we would like to avoid is the possibility that someone would find that a company was "making money" from the audio or video of their children or parents without properly asking them.  We are not interested in any commercial interests ourselves.  Isn't CC-NC the right choice in this case? Is this a problem for the goals of LOD?  In general, we try to make our data as freely available to researchers as possible without any sort of license.  A small fraction of the corpora (3%) are password protected, but the others are not.
> 
> Dear Brian,
> thanks for raising this issue. CC-NC has a very attractive name, and many people feel that what they are doing is noncommercial in nature. As a consequence, they choose CC-NC. The scenario in language documentation (where I come from) is always that Disney takes some folk dance and makes a cartoon out of it. This is what researchers want to avoid.
> 
> The problem is that CC-NC does not only rule out Disney, but also
> - people who use the data on their website and have ads on it, generating (minimal) revenue. This is commercial
> - people who use the data to train their corpora and then sell the software. This is commercial
> - people who use the data to refine their search engines. This is commercial if the search engine makes money
> - companies who provide an app for free which uses the data. The company is a commercial entity, so the usage is commercial (I believe)
> 
> Furthermore, the vision of Linked Open Data is to aggregate data from very many sources. If only a tiny bit of NC is in the aggregate, the whole thing basically becomes NC.
> 
> I share your feelings about video, and to a certain extent audio. These are very personal data, and privacy seems to be main issue here, rather than the possibility to generate revenue. I would probably only release those data as CC-No Derivatives, but YMMV.
> 
> Textual data, especially of the banal nature of everyday speech, is much less sensitive in this regard. For most textual data, I find it hard to imagine a scenario where a stretch of conversation could end up in a commercial product and cause offense. To put it bluntly: Disney is not interested in transcripts of Mummy and Baby talking about bananas.
> 
> The take-home message for me is that CC-NC has a very appealing name, but it misleads people with respect to what it actually covers. CC-NC should never be the default choice, but there are circumstances where it is the right license. In the case of Glottolog for instance, we have to use CC-NC because otherwise we could not use Australian data because of some cultural sensitivities there. For Childes, the question would be: are there any reasons to not use simple CC-BY for textual data, audio, and video? What are these reasons, and is there a way to achieve the same goal without using CC-NC?
> 
> The discussion about the right licenses for data involving human subjects has only just started, and the issue is far from settled.
> 
> Best wishes
> Sebastian
> 
> 
> 
> 
> 
> 
> 
>> 
>> -- Brian MacWhinney
>> 
>> 
>> 
>> _______________________________________________
>> open-linguistics mailing list
>> open-linguistics at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-linguistics
>> 
>> 
> 
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> 





More information about the open-linguistics mailing list