[open-linguistics] Defining "Openness" for Linguistic Linked Open Data

Tue Jan 16 12:32:20 UTC 2018

Dear all,

when we first began developing the Linguistic Linked Open Data cloud  
diagram, we followed a highly permissive approach on criteria for  
inclusion, with the idea to move if from an abstract vision to a set of  
actually usable resources -- in fact the first versions of the diagram  
(before the MLODE workshop in September 2012) are explicitly referred to  
as "drafts" because we included resources whose conversion to LOD had only  
been *promised* the time.

However, the quality criteria have been continuously enforced since then.  
This includes availability, size, number of links, and an explicit  
definition of linguistic relevance as an entry criterion, so that these  
are now roughly equivalent with the LOD criteria.

Along with that, we did *not* enforce an Open Definition-conformant  
license (http://opendefinition.org/licenses/). In particular, arguments  
have been brought forward to include non-commercial resources. One of the  
reasons is that many classical resources developed during the 1990s and  
early 2000s are released under "academic" licenses and that even today,  
entire sub-communities in linguistics tend to be very protective about  
their data. Encouraging noncommercial licenses is a viable compromise to  
reach out to these communities without compromising the idea of embracing  
openness altogether. We did have discussions about this from the very  
beginning, and there are good arguments for either view, but we did *not*  
manage to establish a consensus to exclude, in particular, NC-licensed  
data.

For the moment, openness is (implicitly) defined as being in line with the  
LOD diagram, i.e., we inherit its view that "we take a liberal view of  
what we consider “open”. If the data is openly accessible from a network  
point of view – that is, it's not behind an authorization check or  
paywall" (http://lod-cloud.net/). This approach can be criticized for good  
reasons, but it is an established and transparent practice that goes back  
to the original LOD diagram by Cyganiak and Jentzsch, and that has also  
been documented since then.

Part of this documentation is that under  
http://linguistic-lod.org/llod-cloud, users can get an alternative  
visualization of the diagram with respect to licenses, and as can be  
easily seen, about half of the LLOD bubbles are non-commercial, three have  
no explicit license (which means a restrictive license, in Germany, at  
least), and three more are labeled as "closed" (which may in fact mean  
that different sub-resources have different licenses, e.g.,  
Multext-East[http://nl.ijs.si/ME/V4/], which includes CC-BY-SA and  
CC-BY-NC lexica as well as corpus data under a restricted/non-commercial  
license).

However, this can be a problem for data providers who find their NC data  
in the (L)LOD diagram without being "Open" according to the Open  
Definition, as users of this data may get a wrong impression about their  
usage rights -- despite warnings such as "Before using any data, you  
should always check the publisher's website for the terms and conditions"  
(http://lod-cloud.net/).

The question now is what to do about this situation. Personally, I would  
prefer to roughly stay with the current practice for the LOD and LLOD  
diagrams for the moment, but to provide an explicit statement that *our*  
definition of openness exceeds beyond the Open Definition by including  
non-commercial/"academic" resources, because this is an explicit need in  
(parts of) our community. At the same time, given such a statement,  
resources with unclear (= restrictive) licenses should be removed from the  
diagram. As these are quantitatively marginal anyway, this should not  
affect the usability of LLOD resources and the diagram in comparison to  
its current state.

In any case, this is for the immediate future only. At some point in the  
future, after intense lobbying among our peers and (hopefully) growing  
imporance of OpenDefinition-compliant licenses, we should certainly adopt  
a stricter definition, but for the moment, the growth in resources,  
demonstrating their use and developing applications of (L)LOD should --  
IMHO -- take priority over ideological purity until it is established as a  
conventional approach for (certain kinds of) linguistic data.

This may be controversial, though, so, what do others think?

Best,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931