[open-linguistics] META-NET Data Liberation Campaign

Dave Lewis dave.lewis at cs.tcd.ie
Wed Dec 5 22:40:30 UTC 2012


A very interesting discussion - thanks.

I don't think there is anything inherent in the nature of EU-funded 
industry-academia cooperation that means the industrial partners must 
have rights over language resources produced. This is a matter for 
negotiation in forming the consortium agreement. I've seen projects in 
other thematic areas where the PO is very insistent that models, data 
and even software are made openly available royalty free.

For the language resources community, it may be more the _type_ of 
industrial partner that causes the imposition of restrictions. Companies 
(or public bodies) who are in the business of content publishing or 
content curation rightly see ownership over content as a key business 
resource. Now I guess it is these companies historically who are also 
most actively interested in language resources and language technology 
research and hence EU projects. However, there is a much wider body of 
organisations (commercial, public, social) that create, curate, 
annotate, translate etc content not as their core businesses, but as a 
means of communicating effectively with their 
customers/clients/citizens/members etc. They don't therefore have a 
vested interested in retaining rights over their content and may be 
persuadable  that making it open enables value to be added to their (and 
others') content that may ultimately benefit them. This could be, for 
example, in terms of faster development of language technology in their 
domain/language, or wider access or increased usability of their content 
in novel third party application, e.g. using named entity recognition in 
combination with SMT or personalised cross-lingual search.

As content becomes increasingly perishable due to the accelerating 
lifecycles of the products and services they support, organisations may 
be more easily persuaded that residual value will be more likely from 
making it open and accessible sooner as a resource for innovation, than 
hoarding it privately.

jsut my two cent,

Dave Lewis
CNGL

On 05/12/2012 21:01, Christian Chiarcos wrote:
> 2012/12/5 Nancy Ide <ide at cs.vassar.edu>:
>> I would like to raise a concern here that calling for "open for research"
>> licensing is potentially damaging to our interests, in the sense that it
>> promotes a practice that is counter to what I assume (hope) is the overall
>> goal: fully open data, restricted to no one for any purpose and thereby
>> supportive of collaborative development across both nonprofit and commercial
>> organizations. Given the EU's promotion of collaboration between research
>> and industry in their funding model, it would seem that this would be in the
>> interest of the official language bodies in Europe as well.
> Probably. We had this discussion with Kimmo Rossi in October, and he
> was very sympathetic to the idea, but if I recall correctly also
> somewhat skeptical as to whether this is realistic. One of the
> side-effects of the strong research-industry ties required for EU
> projects is that participating companies may have claims on any
> resulting products, including corpora, lexicons, etc. One
> justification is that the projects are (obligatorily) co-funded by
> these companies. Of course, researchers and EU officials would like to
> see publicly funded research to lead to resources that are open, but
> there is no way to legally enforce it. At best, one may appeal to the
> funding agencies to include shareability of resulting resources in
> their call descriptions, and to consider explicit statements on
> openness, etc. as an evaluation criterion, but that would be basically
> it.
>
> With corpora, we have an additional problem with copyright issues, and
> this is probably the reason why even academic projects to create
> reference corpora are hesitant to release their data under open
> licenses. For some of the reference corpora for German, for example,
> copyright agreements allow to release only small parts of the corpus
> to the public (for example, from the 1.8 billion-token corpus compiled
> at the Berlin-Brandenburg Academy, less than 10% -- only the
> Kernkorpus can be queried online,
> http://www.dwds.de/ressourcen/korpora/#part_4), but also, these are
> restrictive as to the mode of presentation (online querying only,
> limited context view) -- otherwise, no agreement with the publishers
> would have been able at all.  The situation for the "German Reference
> Corpus" (DEREKO) is similar, see diagram under
> http://www.ids-mannheim.de/kl/projekte/korpora/archiv.html#Umfang.
> This is not to be critical about the colleagues in Berlin or Mannheim,
> they're doing great jobs, and I guess this was all what could be
> achieved (and -- maybe -- what seemed necessary) at the time when
> these agreements were negotiated.
>
> But unless these texts become public domain (which should take
> decades), they will *never* be published under an open license, as it
> would run counter the business model of, say, newspaper archives, and
> thus cause economic harm. (And that's where our politicians and the
> lobby behind would be extremely sensitive. As an anecdote on how
> sensitive, one may think of last year's regulations that forced the
> German state-owned television channels to limit the availability of
> news messages in their online archives to no more than 3 years -- this
> was a conscious decision to allow private broadcasters to establish
> commercial news archives, even though crippeling public services
> without any need, and actually causing additional efforts -- and costs
> -- to adjust the infrastructures accordingly.)
>
> It might be possible, however, to convince the copyright holders into
> an academic license, or to adjust current copyright laws to establish
> more flexible guidelines for academic use of these resources. I think
> the latter is one of the goals behind the initiative.
>
> So, for this particular field, an academic license is all we can
> realistically hope for. I also see the danger to establish a practise
> of restrictive research licenses, though. The dilemma is somewhat
> unresolvable. The only thing we can do is to provide guidelines and to
> formulate desiderata which license to use under which circumstances.
>
> All the best,
> Christian
>
>> A look at the impact of the promotion of GNU (copyleft) and "share-alike"
>> licenses makes my point: promotion of these licenses as the "good citizen's
>> license" has had a subtle but pervasive impact on software and data
>> licensing, in that these licenses are at this point the de facto licenses of
>> choice. Unfortunately, such licenses are often not suitable for commercial
>> use because of the requirement to distribute results under the same terms.
>> So what we have is a grass roots effort to be open that in fact has had the
>> result of obstructing full openness. I fear that the promotion of the even
>> stronger research-only restriction will have a similar, and even more
>> damaging, effect.
>>
>> I would recommend promotion of something like the Apache 2.0 license
>> (http://www.apache.org/licenses/LICENSE-2.0), even if (as pointed out in the
>> note below) it is not likely that such a license would be acceptable in this
>> instance. That would send the message that a fully open license is what the
>> community feels is the good citizen's choice, in that it supports
>> collaborative development among both nonprofit and commercial organizations.
>> If there cannot be agreement to adopt this type of licensing, so be it, but
>> the message from the community should be clear about what we see as the
>> ideal.
>>
>> Nancy Ide
>>
>> =======================================================
>> Nancy Ide
>> Professor of Computer Science
>>
>> Department of Computer Science
>> Vassar College
>> Poughkeepsie, New York 12604-0520
>> USA
>>
>> tel: (+1 845) 437 5988
>> fax: (+1 845) 437 7498
>> email: ide at cs.vassar.edu
>> http://www.cs.vassar.edu/~ide
>> =======================================================
>>
>>
>>
>>
>>
>> On Dec 3, 2012, at 7:25 PM, Christian Chiarcos <christian.chiarcos at web.de>
>> wrote:
>>
>> Dear all,
>>
>> as most of us probably know, a number of "reference corpora" for major (and
>> minor) languages of Europe that have been produced in the last decades, but
>> many of them are not fully available to the public (not even under a
>> restrictive license), or available in a snippet view on the web only (and
>> hence unusable for NLP or advanced statistical analyses), -- not to talk
>> about open licenses.
>>
>> To address this issue, META-NET have prepared an open letter to all the
>> official language bodies in Europe and to those holding onto the various
>> corpora calling on them to consider trying to make this important language
>> data available for research purposes.  If you feel that there is a huge
>> benefit to liberating these corpora and making them available for research
>> then please contact your local language body and let them know that you are
>> in favour of the META-NET proposal.
>>
>> More on this can be found on our blog, in a recent post by John Judge,
>> META-NET Ireland
>> (http://linguistics.okfn.org/2012/11/19/meta-net-data-liberation-campaign/,
>> from where the last paragraph was quoted). I'd like to thank John for
>> replicating his original post there and hope this initiative receives some
>> support from the OWLG.
>>
>> Certainly, making these resources available under a research license would
>> not be sufficient in the eyes of many on the list, but it would definitely
>> be an important (and more easily achievable) step towards the further
>> liberation of linguistic data.
>>
>> Thank you,
>> Christian
>> --
>> Christian Chiarcos
>> Information Sciences Institute
>> University of Southern California
>> 4676 Admiralty Way #1001
>> Marina del Rey, CA 90292
>> tel: +1-310-448-9391
>> fax: +1-310-448-8599
>> http://purl.org/chiarcos/home
>> chiarcos at isi.edu
>>
>> _______________________________________________
>> open-linguistics mailing list
>> open-linguistics at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-linguistics
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>>
>>
>>
>> _______________________________________________
>> open-linguistics mailing list
>> open-linguistics at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-linguistics
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics
>>
> _______________________________________________
> open-linguistics mailing list
> open-linguistics at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-linguistics
> Unsubscribe: http://lists.okfn.org/mailman/options/open-linguistics





More information about the open-linguistics mailing list