Hi everybody,

we have been engaging in some of the thinking about that related to the
http://gorilla.linguistlist.org/about/ project, as well as GOLD,
MultiTree and LL-Map. We decided to go with the CC-BY-SA for all data on
GORILLA that we (will) host (basically speech corpora, parallel corpora
and other resources for low-resourced, but also all other languages, and
related speech and language technologies). One of the reasons to drop
the "no commercial use" component completely has to do with the speaker
communities of low resourced languages and organizations that help them
build resourced like education material, technologies, etc. that help
the communities potentially even economically. Many colleagues and
research groups, though, do not want to drop this requirement,
explicitly exclude commercial use.

Now, taking this route to go with the CC-BY-SA for all data requires a
careful planing upfront about the data collection process, resources
used, and so on. It is only possible, if carefully planed as part of the
entire project process. It might be difficult or in fact impossible to
release data as CC-BY-SA later, if various things have not been taken
care of upfront. I might write this up one day and share my experience
with that (here in the US).

Needless to say, attribution and share-alike are fair elements that
should acknowledge the effort that the creators invested (sometimes
significant portions of their life for creating a speech corpus of 10
hours only). Independent of academic questions, this should be also
become an ethical principle in the private sector and with private
individuals, which I do not think it is right now (in particular when we
agree to drop the 'no commercial use' restriction).

As Sebastian mentioned, I would not say that publicly funded project
that create language data should be forced to release the data as CC-BY.
The reality is always more complicated, and besides all kinds of privacy
or security issues, there are also other reasons that researchers and
others do not want to release some data.

Usually research projects are funded with some research goals in mind.
In the past some research funds were used to create resources. This is
not necessarily the case anymore. Since resources (data, corpora) are
usually secondary byproducts and the real research questions are
primary, we cannot expect researchers to invest time and resources in
the preparation of material for public and open dissemination. And then,
if you have ever collected data in the field, you might know how messy
this can be. If you have to invest maybe 100 hours per recorded hour of
spoken language data to clean it up and prepare it for publication, the
question is, who will pay for that and who will have the time to do this.

The funding agencies are us. We are either involved in decision
processes, actively participating in the approval process etc. So, there
is no need to lobby funding agencies to support open data dissemination.
We are all aware of the need to make more data available, and in fact
there are existing guidelines. As mentioned, everything comes with a
price-tag and other complications. What we want to and maybe should push
is an awareness of the potential in sharing data that could maximize
synergies between disciplines and boost tech development, maybe even
improve the private sector, and benefit communities, many other fields
and disciplines (the data creators in fact being rather unaware of large
interest in their data from other disciplines, even across internal
divisions in funding agencies). There should be some sort of platform
that allows people to express interest in data-sets so that the
awareness might grow.

> GOLD, on the other hand, is CC BY 3.0
> <http://linguistics-ontology.org/info/about>, so there is no problem
> in including it (someone should update the metadata).

We are working on a meta-data infrastructure and DOI-assignment for all
resources within the LINGUIST List cloud of projects. The different
resources that we can and want to share (basically all created over the
last two years or so, and from now on, see first paragraph above) will
be marked up better and linked with existing infrastructure. If you want
to help us, please, let us know. In the next months we want to have the
GORILLA resources (basically right now some collection of speech and
parallel corpora, some speech technology tools) and the GOLD data-sets
linked and made openly available online etc. There is a lot of back-end
development at LINGUIST and with the infrastructure of all these
resources. It will take a while, still. Please be patient with us.

Best wishes


Damir Cavar
Department of Linguistics
Indiana University
Co-director and Moderator of The LINGUIST List

