[open-linguistics] META-NET Data Liberation Campaign

Wed Dec 5 21:01:07 UTC 2012

2012/12/5 Nancy Ide <ide at cs.vassar.edu>:
> I would like to raise a concern here that calling for "open for research"
> licensing is potentially damaging to our interests, in the sense that it
> promotes a practice that is counter to what I assume (hope) is the overall
> goal: fully open data, restricted to no one for any purpose and thereby
> supportive of collaborative development across both nonprofit and commercial
> organizations. Given the EU's promotion of collaboration between research
> and industry in their funding model, it would seem that this would be in the
> interest of the official language bodies in Europe as well.

Probably. We had this discussion with Kimmo Rossi in October, and he
was very sympathetic to the idea, but if I recall correctly also
somewhat skeptical as to whether this is realistic. One of the
side-effects of the strong research-industry ties required for EU
projects is that participating companies may have claims on any
resulting products, including corpora, lexicons, etc. One
justification is that the projects are (obligatorily) co-funded by
these companies. Of course, researchers and EU officials would like to
see publicly funded research to lead to resources that are open, but
there is no way to legally enforce it. At best, one may appeal to the
funding agencies to include shareability of resulting resources in
their call descriptions, and to consider explicit statements on
openness, etc. as an evaluation criterion, but that would be basically
it.

With corpora, we have an additional problem with copyright issues, and
this is probably the reason why even academic projects to create
reference corpora are hesitant to release their data under open
licenses. For some of the reference corpora for German, for example,
copyright agreements allow to release only small parts of the corpus
to the public (for example, from the 1.8 billion-token corpus compiled
at the Berlin-Brandenburg Academy, less than 10% -- only the
Kernkorpus can be queried online,
http://www.dwds.de/ressourcen/korpora/#part_4), but also, these are
restrictive as to the mode of presentation (online querying only,
limited context view) -- otherwise, no agreement with the publishers
would have been able at all.  The situation for the "German Reference
Corpus" (DEREKO) is similar, see diagram under
http://www.ids-mannheim.de/kl/projekte/korpora/archiv.html#Umfang.
This is not to be critical about the colleagues in Berlin or Mannheim,
they're doing great jobs, and I guess this was all what could be
achieved (and -- maybe -- what seemed necessary) at the time when
these agreements were negotiated.

But unless these texts become public domain (which should take
decades), they will *never* be published under an open license, as it
would run counter the business model of, say, newspaper archives, and
thus cause economic harm. (And that's where our politicians and the
lobby behind would be extremely sensitive. As an anecdote on how
sensitive, one may think of last year's regulations that forced the
German state-owned television channels to limit the availability of
news messages in their online archives to no more than 3 years -- this
was a conscious decision to allow private broadcasters to establish
commercial news archives, even though crippeling public services
without any need, and actually causing additional efforts -- and costs
-- to adjust the infrastructures accordingly.)

It might be possible, however, to convince the copyright holders into
an academic license, or to adjust current copyright laws to establish
more flexible guidelines for academic use of these resources. I think
the latter is one of the goals behind the initiative.

So, for this particular field, an academic license is all we can
realistically hope for. I also see the danger to establish a practise
of restrictive research licenses, though. The dilemma is somewhat
unresolvable. The only thing we can do is to provide guidelines and to
formulate desiderata which license to use under which circumstances.

All the best,
Christian

> A look at the impact of the promotion of GNU (copyleft) and "share-alike"
> licenses makes my point: promotion of these licenses as the "good citizen's
> license" has had a subtle but pervasive impact on software and data
> licensing, in that these licenses are at this point the de facto licenses of
> choice. Unfortunately, such licenses are often not suitable for commercial
> use because of the requirement to distribute results under the same terms.
> So what we have is a grass roots effort to be open that in fact has had the
> result of obstructing full openness. I fear that the promotion of the even
> stronger research-only restriction will have a similar, and even more
> damaging, effect.
>
> I would recommend promotion of something like the Apache 2.0 license
> (http://www.apache.org/licenses/LICENSE-2.0), even if (as pointed out in the
> note below) it is not likely that such a license would be acceptable in this
> instance. That would send the message that a fully open license is what the
> community feels is the good citizen's choice, in that it supports
> collaborative development among both nonprofit and commercial organizations.
> If there cannot be agreement to adopt this type of licensing, so be it, but
> the message from the community should be clear about what we see as the
> ideal.
>
> Nancy Ide
>
> =======================================================
> Nancy Ide
> Professor of Computer Science
>
> Department of Computer Science
> Vassar College
> Poughkeepsie, New York 12604-0520
> USA
>
> tel: (+1 845) 437 5988
> fax: (+1 845) 437 7498
> email: ide at cs.vassar.edu
> http://www.cs.vassar.edu/~ide
> =======================================================
>
>
>
>
>
> On Dec 3, 2012, at 7:25 PM, Christian Chiarcos <christian.chiarcos at web.de>
> wrote:
>
> Dear all,
>
> as most of us probably know, a number of "reference corpora" for major (and
> minor) languages of Europe that have been produced in the last decades, but
> many of them are not fully available to the public (not even under a
> restrictive license), or available in a snippet view on the web only (and
> hence unusable for NLP or advanced statistical analyses), -- not to talk
> about open licenses.
>
> To address this issue, META-NET have prepared an open letter to all the
> official language bodies in Europe and to those holding onto the various
> corpora calling on them to consider trying to make this important language
> data available for research purposes.  If you feel that there is a huge
> benefit to liberating these corpora and making them available for research
> then please contact your local language body and let them know that you are
> in favour of the META-NET proposal.
>
> More on this can be found on our blog, in a recent post by John Judge,
> META-NET Ireland
> (http://linguistics.okfn.org/2012/11/19/meta-net-data-liberation-campaign/,
> from where the last paragraph was quoted). I'd like to thank John for
> replicating his original post there and hope this initiative receives some
> support from the OWLG.
>
> Certainly, making these resources available under a research license would
> not be sufficient in the eyes of many on the list, but it would definitely
> be an important (and more easily achievable) step towards the further
> liberation of linguistic data.
>
> Thank you,
> Christian
> --
> Christian Chiarcos
> Information Sciences Institute
> University of Southern California
> 4676 Admiralty Way #1001
> Marina del Rey, CA 90292
> tel: +1-310-448-9391
> fax: +1-310-448-8599
> http://purl.org/chiarcos/home
> chiarcos at isi.edu
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>
>
>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>