[open-linguistics] Linked Open Data and Endangered Languages

Tue Oct 20 11:06:38 UTC 2015

Hi Damir,

thank you very much for this brief overview over GORILLA. Certainly a  
highly valuable piece in the language resource landscape, and with the  
possibility of hosting open data, it also provides a great stimulus for  
publishing language resources under non-NC licenses.

GORILLA also comes at a critical point in time where existing European  
infrastructures for endangered languages are in transition, e.g., the TLA  
tools at the MPI Nijmegen. Considering, for example, the lexicon portal  
lexus (https://tla.mpi.nl/tools/tla-tools/older-tools/lexus/), it is  
impossible to register new users to the system since about a year ago, and  
accordingly, it is no longer possible to provide access to certain  
resources. For these (or their providers), GORILLA could represent a great  
alternative.

Please let us know whether any input from the side of this community would  
be helpful for future GORILLA development, e.g., regarding preferred  
formats and tool support, whether you need assistance for developing  
converters or an (L)LOD interface, or recommendations with respect to  
other technological aspects.

Of course, the question of recommended formats is quite an urgent one in  
this context, and if understood correctly, you do not want to create  
yet-another generic formalism. But when focusing on established formats,  
the choice depends on what you want to achieve. For merely depositing  
language resources, a restricted choice of formats with offline tool  
support is a great solution. To mention one format not on your list below,  
Toolbox is perfectly fine for this purpose.

If you want to go beyond this in the longer perspective, e.g., by  
encouraging an (L)LOD conversion for selected resources, or by providing  
online search in these language resources, however, it is important to  
focus on formats which can be converted seamlessly with little to no  
manual interference. Here, Toolbox is a problematic as it is known for  
encoding issues in its output format and because of the problem to align  
morphological units on different layers of interlinear glosses. Depending  
on the type of data, similar issues with interpreting the structure of a  
format can arise with ELAN (judging from our experiences with converting  
the Old German section of the German Reference Corpus,  
http://www.deutschdiachrondigital.de/). TEI has problems, too, most  
noteably complexity, ambiguity and redundancy, so, if convertability is an  
issue, it would be advisable to focus on or to design a specific TEI  
sublanguage. In fact, most current formats do have interoperability  
issues, and we need to discuss and document advantages and disadvantages  
of using one or the other before promoting one of these to yet another  
de-facto standard.

What I would like to suggest is to initiate a discussion with respect to  
which formats to recommend (not necessarily now, but as soon as first data  
sets are available, and certainly not exclusively to this mailing list),  
and a good way to do so would be to present the platform and its goals as  
soon as a minimal level of maturity is achieved, on our blog/website or at  
one of our workshops (e.g., LDL-2016). Please let me know what you think ;)

Thanks a lot,
Christian

Am .10.2015, 16:43 Uhr, schrieb Damir Cavar <dcavar at me.com>:

> Hi there,
>
> we are right now setting up the environment and tools to launch
> something like that, a linked repository for language data and corpora,
> covering among others also endangered language data (processed using eg.
> ELAN, Praat, or Flex). On GORILLA
>
> http://gorilla.linguistlist.org/about/
>
> we will set up our own data for a number of languages, i.e.
> time-aligned, tagged and translated ELAN files (and corresponding audio
> recordings), transcoded and augmented Praat files, and plain TIMIT-type
> of speech corpus files etc. with acoustic and language models for
> training of Forced Aligners or common ASRs. This is aiming at corpora as
> well as speech and language technologies for endangered and
> low-resourced languages, and it will serve also as a repository for
> language documentation data in general. Two corpora that we created this
> summer that are from endangered languages, a bunch more from
> low-resourced ones. Since we are busy setting up some of the technology,
> it might still take us a month or two to come up with the first data and
> catalog.
>
> There are various possibilities. We could host people's data (maybe for
> the time being, till you establish a mirror), attach CMDI
> meta-information to it, assign it DOIs, and put it up in a linked
> catalog that also connects to the CLARIN (and OAI) infrastructure (we
> did not set this up yet, just solved the DOI issue right now), and then
> we can generate the RDF and so on.
>
> We will come up with more tutorials and material on our site that might
> be helpful for you to set up your own repository, maybe mirror ours, so
> that we have cross-continental redundancy and backups etc.
>
> Our requirements are that the data is at least CC BY-SA (or CC BY, but
> we would not accept an NC clause), if we are to invest in the above
> described process of meta-information and linking. We can take any kind
> of data into the archive, though. Our GORILLA goal is to document as
> many languages as possible using audio or video recordings,
> transcription, phonetic or phonemic transcription, PoS-tagging,
> translation. We are using a uniform standard (or two, e.g. ELAN
> annotation format, Praat TextGrid, TEI XML) and interoperable linguistic
> tags or annotations (linking to GOLD and/or ISOCat).
>
> Ritesh, I do not know, whether I could make it to Agra in February, but
> I will be somewhere closer there next year spring. We could talk about
> that over a private line.
>
> Best wishes
>
> Damir
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931