[open-linguistics] Inclusion of 'non-open' resources in the LLOD cloud diagram
Christian Chiarcos
chiarcos at informatik.uni-frankfurt.de
Wed Sep 9 09:06:26 UTC 2015
Dear all,
first of all thanks to John for summarizing our discussion. Some of these
suggestions are controversial, indeed, so we should have a discussion as
broad as possible. For me personally, it is not so much a question whether
proprietary resources should be removed from the diagram, but when. This
does not have to be now, but the diagram is populated to a degree that it
might benefit from some pruning (just in case we cannot convince *every*
data provider to conform to the enforced inclusion criteria -- which is
what we are hoping for, of course).
> 1. Up until now we have allowed a small number of 'closed' resources
> into the LLOD cloud diagram >although they do not have an open license.
The rationale behind this proposal is to enforce quality criteria. We have
done so in the past for the diagram, we started with a very low entry
barrier (for the LREC-2012 "draft" -- not a proper LLOD diagram yet -- it
was just "promise to provide a LOD edition at some point" !), but the goal
has been to create something substantial along the lines of the LOD
diagram, and raising the entry barrier as soon as we had a critical mass
of higher-quality resources brought us pretty far in this direction.
In the context of the Open Knowledge Foundation (and that's where the "O"
in OWLG and LLOD comes from), "open" is defined by the Open Definition.
For the needs of the language resource community, a broader definition may
be more applicable (see 2), but the general goal remains to promote open
resources in that sense. The idea is not so much to remove resources, but
rather to provide an impetus for data providers to consider open licenses
for current LLOD resources.
This does not mean that, eventually, all resources in the LLOD diagram
need to be compliant to the Open Definition, we are free to come up with
our own, community-specific definition of openness, but this requires a
consensus (which requires a discussion). I'm perfectly happy to follow
Victor's proposal to define open as "using open standards" if that's the
emerging consensus. But I agree that this means watering down an
established definition and we'll probably not get a lot support for doing
so.
At the same time, I would also like to see proprietary resources staying
linked to the LLOD diagram (or a version of it), but marked in a special
way and such that the truly open resources in the LLOD diagram can be
visualized on their own without leaving immense gaps between open Bubbles
(and without re-positioning). The current visualization is somewhat
misleading in this regard -- the primary colors are for types of language
resources. Even though Victor's work on licenses is provided in an
appealing alternative visualization, we cannot currently visualize both
types of information at the same time.
IMHO, the ideal visualization would be to have a LLOD diagram proper, and
an extended L(O)LD diagram which takes the (visually unaltered) LLOD
diagram at its core ("the inner circle" of language resources), but groups
proprietary resources around it ("the outer circle" of language
resources). I think I've seen something like that on one of Victor's
slides at some point. Do you think we could extend the script to produce
something like this?
> 1.a Currently, there are four resources that are included with a closed
> license, and as they are >not referred to by other datasets I propose we
> remove them.
I tend to agree, but this seems quite ad hoc. Another way of having an
extended LLOD cloud (with some proprietary resources included) would be to
include proprietary resources if they are pointed to from open (in the
sense of the open definition) resources. This would allow BabelNet to
remain in. But if we do, we should state this as an additional rule for
inclusion. However, there is potential for abusing such a rule.
> 2. There are a number of resources with restrictive licenses like
> CC-BY-NC and CC-BY-SA. In the >interest of showing a wide acceptance of
> resources in the LLOD community, I move we continue to >accept any CC
> (or equivalent) licensed resource as 'open.' Should we add a particular
> visual >indicator to these resources in future version of the cloud?
As much as I'm supporting removing proprietary resources (just because
it's misleading to have them in an LLOD diagram), I'd like to keep
CC-BY-NC and -SA included (for the moment). A while ago we had some
discussion about "academic" licenses, and back then, many people seemed to
be in favor of admitting these. CC-BY-NC can be seen as a rough
approximation of the more informal "academic" and we can "sell" it as such
to data providers. In addition, CC-BY-NC is actually a way for current
proprietary resources to remain in the diagram: By publishing (a subset?)
of their resource under an NC license, they prevent interference with
their commercial interests but can remain in the diagram. At some later
point in time, however, we may discuss whether an even stricter definition
of openness should be adopted, and this might include moving from CC-BY-NC
or -SA to only CC-BY-SA and maybe even further. It depends on the input we
get from the community. Ideas? Impressions?
> 3. Some resources do not have a license in the metadata. From the next
> version of the diagram I >propose they are not included in the diagram,
> this affects: PDEV-lemon, GOLD, CLLD-GLOTTOLOG, >Zhishi.me, DBpedia-de
> and xLiD-Lexica. (Affected parties please add a license to the record on
> >http://datahub.io)
Again, this is not so much about removing resources as it is to motivate
data providers to include meaningful metadata. I would fully support that
but we need to learn about possible reasons for data providers not doing
so. There may be bureaucratic reasons for not giving resources an explicit
license declaration but only informal statements such as "we support open
licenses" given in personal communication or as a footnote in the
documentation. The latter seems to be the case for GOLD and ISOcat -- both
extraordinarily valuable for linguists and NLP people alike.
> 4. A particular point of interest was raised about BabelNet in the last
> telco, in particular to >recent changes, such as no longer offering
> direct download of the whole up-to-date dataset. The >criteria for
> inclusion in the LLOD cloud are given here and as BabelNet offers SPARQL
> access it >still qualifies for inclusion in the LLOD cloud. Assuming we
> are not changing the guidelines I see >no reason to remove BabelNet from
> the LLOD cloud.
I didn't check the end point, but is this the same data? If not, it should
get a version number to avoid confusion. But I'm curious what others think
about BabelNet. Does the end point have limits on the number of queries of
the size of the result set? If so, how shall we deal with such technically
semi-restricted resources?
Just my 2 cents (well, maybe 0.50€ ;)
Best,
Christian
More information about the open-linguistics
mailing list