[open-linguistics] Linked Open Data and Endangered Languages
Christian Chiarcos
chiarcos at informatik.uni-frankfurt.de
Tue Oct 20 11:06:38 UTC 2015
Hi Damir,
thank you very much for this brief overview over GORILLA. Certainly a
highly valuable piece in the language resource landscape, and with the
possibility of hosting open data, it also provides a great stimulus for
publishing language resources under non-NC licenses.
GORILLA also comes at a critical point in time where existing European
infrastructures for endangered languages are in transition, e.g., the TLA
tools at the MPI Nijmegen. Considering, for example, the lexicon portal
lexus (https://tla.mpi.nl/tools/tla-tools/older-tools/lexus/), it is
impossible to register new users to the system since about a year ago, and
accordingly, it is no longer possible to provide access to certain
resources. For these (or their providers), GORILLA could represent a great
alternative.
Please let us know whether any input from the side of this community would
be helpful for future GORILLA development, e.g., regarding preferred
formats and tool support, whether you need assistance for developing
converters or an (L)LOD interface, or recommendations with respect to
other technological aspects.
Of course, the question of recommended formats is quite an urgent one in
this context, and if understood correctly, you do not want to create
yet-another generic formalism. But when focusing on established formats,
the choice depends on what you want to achieve. For merely depositing
language resources, a restricted choice of formats with offline tool
support is a great solution. To mention one format not on your list below,
Toolbox is perfectly fine for this purpose.
If you want to go beyond this in the longer perspective, e.g., by
encouraging an (L)LOD conversion for selected resources, or by providing
online search in these language resources, however, it is important to
focus on formats which can be converted seamlessly with little to no
manual interference. Here, Toolbox is a problematic as it is known for
encoding issues in its output format and because of the problem to align
morphological units on different layers of interlinear glosses. Depending
on the type of data, similar issues with interpreting the structure of a
format can arise with ELAN (judging from our experiences with converting
the Old German section of the German Reference Corpus,
http://www.deutschdiachrondigital.de/). TEI has problems, too, most
noteably complexity, ambiguity and redundancy, so, if convertability is an
issue, it would be advisable to focus on or to design a specific TEI
sublanguage. In fact, most current formats do have interoperability
issues, and we need to discuss and document advantages and disadvantages
of using one or the other before promoting one of these to yet another
de-facto standard.
What I would like to suggest is to initiate a discussion with respect to
which formats to recommend (not necessarily now, but as soon as first data
sets are available, and certainly not exclusively to this mailing list),
and a good way to do so would be to present the platform and its goals as
soon as a minimal level of maturity is achieved, on our blog/website or at
one of our workshops (e.g., LDL-2016). Please let me know what you think ;)
Thanks a lot,
Christian
Am .10.2015, 16:43 Uhr, schrieb Damir Cavar <dcavar at me.com>:
> Hi there,
>
> we are right now setting up the environment and tools to launch
> something like that, a linked repository for language data and corpora,
> covering among others also endangered language data (processed using eg.
> ELAN, Praat, or Flex). On GORILLA
>
> http://gorilla.linguistlist.org/about/
>
> we will set up our own data for a number of languages, i.e.
> time-aligned, tagged and translated ELAN files (and corresponding audio
> recordings), transcoded and augmented Praat files, and plain TIMIT-type
> of speech corpus files etc. with acoustic and language models for
> training of Forced Aligners or common ASRs. This is aiming at corpora as
> well as speech and language technologies for endangered and
> low-resourced languages, and it will serve also as a repository for
> language documentation data in general. Two corpora that we created this
> summer that are from endangered languages, a bunch more from
> low-resourced ones. Since we are busy setting up some of the technology,
> it might still take us a month or two to come up with the first data and
> catalog.
>
> There are various possibilities. We could host people's data (maybe for
> the time being, till you establish a mirror), attach CMDI
> meta-information to it, assign it DOIs, and put it up in a linked
> catalog that also connects to the CLARIN (and OAI) infrastructure (we
> did not set this up yet, just solved the DOI issue right now), and then
> we can generate the RDF and so on.
>
> We will come up with more tutorials and material on our site that might
> be helpful for you to set up your own repository, maybe mirror ours, so
> that we have cross-continental redundancy and backups etc.
>
> Our requirements are that the data is at least CC BY-SA (or CC BY, but
> we would not accept an NC clause), if we are to invest in the above
> described process of meta-information and linking. We can take any kind
> of data into the archive, though. Our GORILLA goal is to document as
> many languages as possible using audio or video recordings,
> transcription, phonetic or phonemic transcription, PoS-tagging,
> translation. We are using a uniform standard (or two, e.g. ELAN
> annotation format, Praat TextGrid, TEI XML) and interoperable linguistic
> tags or annotations (linking to GOLD and/or ISOCat).
>
> Ritesh, I do not know, whether I could make it to Agra in February, but
> I will be somewhere closer there next year spring. We could talk about
> that over a private line.
>
> Best wishes
>
> Damir
--
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931
More information about the open-linguistics
mailing list