[open-linguistics] Defining "Openness" for Linguistic Linked Open Data
chiarcos at informatik.uni-frankfurt.de
Tue Jan 16 12:32:20 UTC 2018
when we first began developing the Linguistic Linked Open Data cloud
diagram, we followed a highly permissive approach on criteria for
inclusion, with the idea to move if from an abstract vision to a set of
actually usable resources -- in fact the first versions of the diagram
(before the MLODE workshop in September 2012) are explicitly referred to
as "drafts" because we included resources whose conversion to LOD had only
been *promised* the time.
However, the quality criteria have been continuously enforced since then.
This includes availability, size, number of links, and an explicit
definition of linguistic relevance as an entry criterion, so that these
are now roughly equivalent with the LOD criteria.
Along with that, we did *not* enforce an Open Definition-conformant
license (http://opendefinition.org/licenses/). In particular, arguments
have been brought forward to include non-commercial resources. One of the
reasons is that many classical resources developed during the 1990s and
early 2000s are released under "academic" licenses and that even today,
entire sub-communities in linguistics tend to be very protective about
their data. Encouraging noncommercial licenses is a viable compromise to
reach out to these communities without compromising the idea of embracing
openness altogether. We did have discussions about this from the very
beginning, and there are good arguments for either view, but we did *not*
manage to establish a consensus to exclude, in particular, NC-licensed
For the moment, openness is (implicitly) defined as being in line with the
LOD diagram, i.e., we inherit its view that "we take a liberal view of
what we consider “open”. If the data is openly accessible from a network
point of view – that is, it's not behind an authorization check or
paywall" (http://lod-cloud.net/). This approach can be criticized for good
reasons, but it is an established and transparent practice that goes back
to the original LOD diagram by Cyganiak and Jentzsch, and that has also
been documented since then.
Part of this documentation is that under
http://linguistic-lod.org/llod-cloud, users can get an alternative
visualization of the diagram with respect to licenses, and as can be
easily seen, about half of the LLOD bubbles are non-commercial, three have
no explicit license (which means a restrictive license, in Germany, at
least), and three more are labeled as "closed" (which may in fact mean
that different sub-resources have different licenses, e.g.,
Multext-East[http://nl.ijs.si/ME/V4/], which includes CC-BY-SA and
CC-BY-NC lexica as well as corpus data under a restricted/non-commercial
However, this can be a problem for data providers who find their NC data
in the (L)LOD diagram without being "Open" according to the Open
Definition, as users of this data may get a wrong impression about their
usage rights -- despite warnings such as "Before using any data, you
should always check the publisher's website for the terms and conditions"
The question now is what to do about this situation. Personally, I would
prefer to roughly stay with the current practice for the LOD and LLOD
diagrams for the moment, but to provide an explicit statement that *our*
definition of openness exceeds beyond the Open Definition by including
non-commercial/"academic" resources, because this is an explicit need in
(parts of) our community. At the same time, given such a statement,
resources with unclear (= restrictive) licenses should be removed from the
diagram. As these are quantitatively marginal anyway, this should not
affect the usability of LLOD resources and the diagram in comparison to
its current state.
In any case, this is for the immediate future only. At some point in the
future, after intense lobbying among our peers and (hopefully) growing
imporance of OpenDefinition-compliant licenses, we should certainly adopt
a stricter definition, but for the moment, the growth in resources,
demonstrating their use and developing applications of (L)LOD should --
IMHO -- take priority over ideological purity until it is established as a
conventional approach for (certain kinds of) linguistic data.
This may be controversial, though, so, what do others think?
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
More information about the open-linguistics