[open-linguistics] Inclusion of 'non-open' resources in the LLOD cloud diagram

Wed Sep 9 09:06:26 UTC 2015

Dear all,

first of all thanks to John for summarizing our discussion. Some of these  
suggestions are controversial, indeed, so we should have a discussion as  
broad as possible. For me personally, it is not so much a question whether  
proprietary resources should be removed from the diagram, but when. This  
does not have to be now, but the diagram is populated to a degree that it  
might benefit from some pruning (just in case we cannot convince *every*  
data provider to conform to the enforced inclusion criteria -- which is  
what we are hoping for, of course).

> 1. Up until now we have allowed a small number of 'closed' resources  
> into the LLOD cloud diagram >although they do not have an open license.

The rationale behind this proposal is to enforce quality criteria. We have  
done so in the past for the diagram, we started with a very low entry  
barrier (for the LREC-2012 "draft" -- not a proper LLOD diagram yet -- it  
was just "promise to provide a LOD edition at some point" !), but the goal  
has been to create something substantial along the lines of the LOD  
diagram, and raising the entry barrier as soon as we had a critical mass  
of higher-quality resources brought us pretty far in this direction.
In the context of the Open Knowledge Foundation (and that's where the "O"  
in OWLG and LLOD comes from), "open" is defined by the Open Definition.  
For the needs of the language resource community, a broader definition may  
be more applicable (see 2), but the general goal remains to promote open  
resources in that sense. The idea is not so much to remove resources, but  
rather to provide an impetus for data providers to consider open licenses  
for current LLOD resources.
This does not mean that, eventually, all resources in the LLOD diagram  
need to be compliant to the Open Definition, we are free to come up with  
our own, community-specific definition of openness, but this requires a  
consensus (which requires a discussion).  I'm perfectly happy to follow  
Victor's proposal to define open as "using open standards" if that's the  
emerging consensus. But I agree that this means watering down an  
established definition and we'll probably not get a lot support for doing  
so.

At the same time, I would also like to see proprietary resources staying  
linked to the LLOD diagram (or a version of it), but marked in a special  
way and such that the truly open resources in the LLOD diagram can be  
visualized on their own without leaving immense gaps between open Bubbles  
(and without re-positioning). The current visualization is somewhat  
misleading in this regard -- the primary colors are for types of language  
resources. Even though Victor's work on licenses is provided in an  
appealing alternative visualization, we cannot currently visualize both  
types of information at the same time.

IMHO, the ideal visualization would be to have a LLOD diagram proper, and  
an extended L(O)LD diagram which takes the (visually unaltered) LLOD  
diagram at its core ("the inner circle" of language resources), but groups  
proprietary resources around it ("the outer circle" of language  
resources). I think I've seen something like that on one of Victor's  
slides at some point. Do you think we could extend the script to produce  
something like this?

> 1.a Currently, there are four resources that are included with a closed  
> license, and as they are >not referred to by other datasets I propose we  
> remove them.

I tend to agree, but this seems quite ad hoc. Another way of having an  
extended LLOD cloud (with some proprietary resources included) would be to  
include proprietary resources if they are pointed to from open (in the  
sense of the open definition) resources. This would allow BabelNet to  
remain in. But if we do, we should state this as an additional rule for  
inclusion. However, there is potential for abusing such a rule.

> 2. There are a number of resources with restrictive licenses like  
> CC-BY-NC and CC-BY-SA. In the >interest of showing a wide acceptance of  
> resources in the LLOD community, I move we continue to >accept any CC  
> (or equivalent) licensed resource as 'open.' Should we add a particular  
> visual >indicator to these resources in future version of the cloud?

As much as I'm supporting removing proprietary resources (just because  
it's misleading to have them in an LLOD diagram), I'd like to keep  
CC-BY-NC and -SA included (for the moment). A while ago we had some  
discussion about "academic" licenses, and back then, many people seemed to  
be in favor of admitting these. CC-BY-NC can be seen as a rough  
approximation of the more informal "academic" and we can "sell" it as such  
to data providers. In addition, CC-BY-NC is actually a way for current  
proprietary resources to remain in the diagram: By publishing (a subset?)  
of their resource under an NC license, they prevent interference with  
their commercial interests but can remain in the diagram. At some later  
point in time, however, we may discuss whether an even stricter definition  
of openness should be adopted, and this might include moving from CC-BY-NC  
or -SA to only CC-BY-SA and maybe even further. It depends on the input we  
get from the community. Ideas? Impressions?

> 3. Some resources do not have a license in the metadata. From the next  
> version of the diagram I >propose they are not included in the diagram,  
> this affects: PDEV-lemon, GOLD, CLLD-GLOTTOLOG, >Zhishi.me, DBpedia-de  
> and xLiD-Lexica. (Affected parties please add a license to the record on  
> >http://datahub.io)

Again, this is not so much about removing resources as it is to motivate  
data providers to include meaningful metadata. I would fully support that  
but we need to learn about possible reasons for data providers not doing  
so. There may be bureaucratic reasons for not giving resources an explicit  
license declaration but only informal statements such as "we support open  
licenses" given in personal communication or as a footnote in the  
documentation. The latter seems to be the case for GOLD and ISOcat -- both  
extraordinarily valuable for linguists and NLP people alike.

> 4. A particular point of interest was raised about BabelNet in the last  
> telco, in particular to >recent changes, such as no longer offering  
> direct download of the whole up-to-date dataset. The >criteria for  
> inclusion in the LLOD cloud are given here and as BabelNet offers SPARQL  
> access it >still qualifies for inclusion in the LLOD cloud. Assuming we  
> are not changing the guidelines I see >no reason to remove BabelNet from  
> the LLOD cloud.

I didn't check the end point, but is this the same data? If not, it should  
get a version number to avoid confusion. But I'm curious what others think  
about BabelNet. Does the end point have limits on the number of queries of  
the size of the result set? If so, how shall we deal with such technically  
semi-restricted resources?

Just my 2 cents (well, maybe 0.50€ ;)

Best,
Christian