[open-linguistics] CC licenses
Sebastian Nordhoff
sebastian_nordhoff at eva.mpg.de
Thu May 31 21:27:12 UTC 2012
On Thu, 31 May 2012 15:33:08 +0300, Brian MacWhinney <macw at cmu.edu> wrote:
> Dear Open-Linguistics,
> I have just now subscribed to this list, based on urging from
> Sebastian Hellman. I was interested in the idea of incorporating the
> CHILDES and TalkBank corpora for spoken language into LLOD and Sebastian
> asked me why we were relying on the CC-NC license, rather than the
> CC-BY-SA license. I told him that the basic motivation involved the
> feelings of the people who had contributed data to the corpus. Our data
> include audio and video and transcripts from children, students,
> aphasics, etc. across many languages. What we would like to avoid is
> the possibility that someone would find that a company was "making
> money" from the audio or video of their children or parents without
> properly asking them. We are not interested in any commercial interests
> ourselves. Isn't CC-NC the right choice in this case? Is this a problem
> for the goals of LOD? In general, we try to make our data as freely
> available to researchers as possible without any sort of license. A
> small fraction of the corpora (3%) are password protected, but the
> others are not.
Dear Brian,
thanks for raising this issue. CC-NC has a very attractive name, and many
people feel that what they are doing is noncommercial in nature. As a
consequence, they choose CC-NC. The scenario in language documentation
(where I come from) is always that Disney takes some folk dance and makes
a cartoon out of it. This is what researchers want to avoid.
The problem is that CC-NC does not only rule out Disney, but also
- people who use the data on their website and have ads on it, generating
(minimal) revenue. This is commercial
- people who use the data to train their corpora and then sell the
software. This is commercial
- people who use the data to refine their search engines. This is
commercial if the search engine makes money
- companies who provide an app for free which uses the data. The company
is a commercial entity, so the usage is commercial (I believe)
Furthermore, the vision of Linked Open Data is to aggregate data from very
many sources. If only a tiny bit of NC is in the aggregate, the whole
thing basically becomes NC.
I share your feelings about video, and to a certain extent audio. These
are very personal data, and privacy seems to be main issue here, rather
than the possibility to generate revenue. I would probably only release
those data as CC-No Derivatives, but YMMV.
Textual data, especially of the banal nature of everyday speech, is much
less sensitive in this regard. For most textual data, I find it hard to
imagine a scenario where a stretch of conversation could end up in a
commercial product and cause offense. To put it bluntly: Disney is not
interested in transcripts of Mummy and Baby talking about bananas.
The take-home message for me is that CC-NC has a very appealing name, but
it misleads people with respect to what it actually covers. CC-NC should
never be the default choice, but there are circumstances where it is the
right license. In the case of Glottolog for instance, we have to use CC-NC
because otherwise we could not use Australian data because of some
cultural sensitivities there. For Childes, the question would be: are
there any reasons to not use simple CC-BY for textual data, audio, and
video? What are these reasons, and is there a way to achieve the same goal
without using CC-NC?
The discussion about the right licenses for data involving human subjects
has only just started, and the issue is far from settled.
Best wishes
Sebastian
>
> -- Brian MacWhinney
>
>
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
>
>
More information about the open-linguistics
mailing list