[open-linguistics] Linked Open Data and Endangered Languages

Tue Oct 20 14:28:49 UTC 2015

Hi Chris and everybody,

On 10/20/2015 07:06 AM, Christian Chiarcos wrote:

> GORILLA also comes at a critical point in time where existing European
> infrastructures for endangered languages are in transition, e.g., the
> TLA tools at the MPI Nijmegen. Considering, for example, the lexicon
> portal lexus (https://tla.mpi.nl/tools/tla-tools/older-tools/lexus/), it
> is impossible to register new users to the system since about a year
> ago, and accordingly, it is no longer possible to provide access to
> certain resources. For these (or their providers), GORILLA could
> represent a great alternative.

Many projects and institutions have to deal with the question of
sustainability and continued funding not only for the development of new
technologies and tools, but also for maintenance of existing resources.
This is a serious issue. We need to come up with models to keep the
important and existing infrastructure alive.

> Please let us know whether any input from the side of this community
> would be helpful for future GORILLA development, e.g., regarding
> preferred formats and tool support, whether you need assistance for
> developing converters or an (L)LOD interface, or recommendations with
> respect to other technological aspects.

We welcome collaborators, volunteers, and helpers!

We have to work out documents that describe the recommendations or
requirements for different parts of a data-set (e.g. meta-information,
formats, RDF etc.). As for some aspects of language documentation and
technologies we maintain the EMELD (http://emeld.org/) pages. But these
are outdated and need urgently some update. We might consider applying
for funding to do exactly that, i.e. to develop new documents that
describe best practice in different areas of technology for language
documentation. Joint grant applications with e.g. European and
US-institutions are possible, and as far as we at LINGUIST List are
concerned, they are desired, in fact, inclusion of the global community
would be ideal, but I know little of joint grant opportunities with
other regions than Europe.

While we are working on the back-end for managing all the digital data
and storage issues, we do want to develop as well web-based front-ends
for viewing, commenting, searching and so on. This includes formats like
interlinear glossed text, time-alignments over audio and video as
produced by ELAN or Praat, constituent and dependency trees in
treebanks, mappings of tags to some ontology or taxonomy for
terminology, etc. This means that we will try to integrate existing open
web-based tools, but also develop our own. The code of the GORILLA-site
is using Python 3.x and Django 1.8 and it is stored in a Bitbucket
repository. We might be able to add a certain number of co-developers.
We should talk about that.

> Of course, the question of recommended formats is quite an urgent one in
> this context, and if understood correctly, you do not want to create
> yet-another generic formalism. But when focusing on established formats,
> the choice depends on what you want to achieve. For merely depositing
> language resources, a restricted choice of formats with offline tool
> support is a great solution. To mention one format not on your list
> below, Toolbox is perfectly fine for this purpose.

We have a broader goal here with GORILLA. It is not just about storing
language documentation data. We are converting this data to common
corpus formats. So the speech recordings and transcriptions we convert
to common speech corpus formats and extract data and models from those
to be able to train different types of Forced Alignment tools or basic
ASRs. We develop computational morphologies for some of the languages,
basic CFGs, even Feature Grammars that can be useful for LFG parsers,
etc. We will provide the corpora, data and technologies for the
different languages on GORILLA, e.g. a first Yiddish speech corpus of
several hours with two functioning Forced Aligners and a basic set of
models for ASR development, same for Chatino, Burmese, etc. The data is
being PoS-tagged and word-by-word and utterance-translated (to English,
partially to German too).

The goals are, among others:

- to archive digital language data, and develop an archive
infrastructure that allows for global and interconnected repos that can
mirror each other, sync annotations and extensions, etc.

- bring the language data into some format that allows for
cross-linguistic and cross-linguistic level analyses and studies (some
level of compatibility of data encoding and interoperable annotation
required), our large research goal being a snapshot of all languages
that grows over time

- bring the data into various common formats for speech and language
technology tools and environments (enable it as a training/testing
corpus, bootstrap technologies for low-resourced or endangered
languages, as well as offer resources for common languages that so far
are inaccessible to economically challenged members of the global
community, researchers, or corpus or NLP-interested people)

The story is complicated and long. Hopefully we will get to a paper on
that, and maybe we will have an opportunity to present this some time
somewhere.

Right now we have started launching the initial infrastructure,
negotiating various necessary components, creating first corpora and
data-sets that we can present online, working out documents, legal
arrangements at our hosting institution, etc.

> If you want to go beyond this in the longer perspective, e.g., by
> encouraging an (L)LOD conversion for selected resources, or by providing
> online search in these language resources, (...) have interoperability
> issues, and we need to discuss and document advantages and disadvantages
> of using one or the other before promoting one of these to yet another
> de-facto standard.
> 
> What I would like to suggest is to initiate a discussion with respect to
> which formats to recommend (not necessarily now, but as soon as first
> data sets are available, and certainly not exclusively to this mailing
> list), and a good way to do so would be to present the platform and its
> goals as soon as a minimal level of maturity is achieved, on our
> blog/website or at one of our workshops (e.g., LDL-2016). Please let me
> know what you think ;)

We are submitting some abstracts to the next LREC, we could have a
spontaneous workshop at the beach in Portoroz, should at least one of
those be accepted... :-) Or, we could organize some alternative location
and environment for an LREC chat on that topic. We would be happy to
participate at LDL and discuss some of that.

We would also be happy to include some of you in a proposal, or get
included in yours, applying for funding to support the development of an
infrastructure, resources, network of intercontinental mirrors and
networked systems for language data, etc.

Thanks!

All the best

Damir

-- 
Damir Cavar
Dept. of Linguistics, Indiana University
Co-director and Moderator of The LINGUIST List
https://linguistlist.org/people/damir_cavar.html

> Am .10.2015, 16:43 Uhr, schrieb Damir Cavar <dcavar at me.com>:
> 
>> Hi there,
>>
>> we are right now setting up the environment and tools to launch
>> something like that, a linked repository for language data and corpora,
>> covering among others also endangered language data (processed using eg.
>> ELAN, Praat, or Flex). On GORILLA
>>
>> http://gorilla.linguistlist.org/about/
>>
>> we will set up our own data for a number of languages, i.e.
>> time-aligned, tagged and translated ELAN files (and corresponding audio
>> recordings), transcoded and augmented Praat files, and plain TIMIT-type
>> of speech corpus files etc. with acoustic and language models for
>> training of Forced Aligners or common ASRs. This is aiming at corpora as
>> well as speech and language technologies for endangered and
>> low-resourced languages, and it will serve also as a repository for
>> language documentation data in general. Two corpora that we created this
>> summer that are from endangered languages, a bunch more from
>> low-resourced ones. Since we are busy setting up some of the technology,
>> it might still take us a month or two to come up with the first data and
>> catalog.
>>
>> There are various possibilities. We could host people's data (maybe for
>> the time being, till you establish a mirror), attach CMDI
>> meta-information to it, assign it DOIs, and put it up in a linked
>> catalog that also connects to the CLARIN (and OAI) infrastructure (we
>> did not set this up yet, just solved the DOI issue right now), and then
>> we can generate the RDF and so on.
>>
>> We will come up with more tutorials and material on our site that might
>> be helpful for you to set up your own repository, maybe mirror ours, so
>> that we have cross-continental redundancy and backups etc.
>>
>> Our requirements are that the data is at least CC BY-SA (or CC BY, but
>> we would not accept an NC clause), if we are to invest in the above
>> described process of meta-information and linking. We can take any kind
>> of data into the archive, though. Our GORILLA goal is to document as
>> many languages as possible using audio or video recordings,
>> transcription, phonetic or phonemic transcription, PoS-tagging,
>> translation. We are using a uniform standard (or two, e.g. ELAN
>> annotation format, Praat TextGrid, TEI XML) and interoperable linguistic
>> tags or annotations (linking to GOLD and/or ISOCat).
>>
>> Ritesh, I do not know, whether I could make it to Agra in February, but
>> I will be somewhere closer there next year spring. We could talk about
>> that over a private line.
>>
>> Best wishes
>>
>> Damir