[open-linguistics] Linked Open Data and Endangered Languages

Dafydd Gibbon gibbon at uni-bielefeld.de
Tue Oct 20 16:57:43 UTC 2015


Hi Damir,

Thanks for putting me in the loop. Sounds like an  useful way of 
addressing some worrying sustainability and interoperability issues. As 
you say, you do need to get stuff out there at LREC and in the LREV 
online journal to make the initative more widely known and accepted.

Btw, wrt forced alignment, do you know Brigitte Bigi's tool SPPAS for 
annotation generation? Here (check the 'Libre Software' button):
http://www.lpl-aix.fr/~bigi/
It uses the Julius tool. It would be a very good idea to get her on 
board. Among other things, she has excellent contacts with China. 
Brigitte is located at the phonetics lab in Aix-en-Provence, which has a 
large sustainable repository, and are connected with the European CLARIN 
network, with which I am sure you will already have connections.

Being retired I have no material infrastructure to offer, but am very 
willing to contribute in other ways. I am planning to be at LREC and am 
looking forward to that beach meeting!

Cheers,

	Dafydd


Am 20.10.2015 um 16:28 schrieb Damir Cavar:
> Hi Chris and everybody,
>
> On 10/20/2015 07:06 AM, Christian Chiarcos wrote:
>
>> GORILLA also comes at a critical point in time where existing European
>> infrastructures for endangered languages are in transition, e.g., the
>> TLA tools at the MPI Nijmegen. Considering, for example, the lexicon
>> portal lexus (https://tla.mpi.nl/tools/tla-tools/older-tools/lexus/), it
>> is impossible to register new users to the system since about a year
>> ago, and accordingly, it is no longer possible to provide access to
>> certain resources. For these (or their providers), GORILLA could
>> represent a great alternative.
>
> Many projects and institutions have to deal with the question of
> sustainability and continued funding not only for the development of new
> technologies and tools, but also for maintenance of existing resources.
> This is a serious issue. We need to come up with models to keep the
> important and existing infrastructure alive.
>
>> Please let us know whether any input from the side of this community
>> would be helpful for future GORILLA development, e.g., regarding
>> preferred formats and tool support, whether you need assistance for
>> developing converters or an (L)LOD interface, or recommendations with
>> respect to other technological aspects.
>
> We welcome collaborators, volunteers, and helpers!
>
> We have to work out documents that describe the recommendations or
> requirements for different parts of a data-set (e.g. meta-information,
> formats, RDF etc.). As for some aspects of language documentation and
> technologies we maintain the EMELD (http://emeld.org/) pages. But these
> are outdated and need urgently some update. We might consider applying
> for funding to do exactly that, i.e. to develop new documents that
> describe best practice in different areas of technology for language
> documentation. Joint grant applications with e.g. European and
> US-institutions are possible, and as far as we at LINGUIST List are
> concerned, they are desired, in fact, inclusion of the global community
> would be ideal, but I know little of joint grant opportunities with
> other regions than Europe.
>
> While we are working on the back-end for managing all the digital data
> and storage issues, we do want to develop as well web-based front-ends
> for viewing, commenting, searching and so on. This includes formats like
> interlinear glossed text, time-alignments over audio and video as
> produced by ELAN or Praat, constituent and dependency trees in
> treebanks, mappings of tags to some ontology or taxonomy for
> terminology, etc. This means that we will try to integrate existing open
> web-based tools, but also develop our own. The code of the GORILLA-site
> is using Python 3.x and Django 1.8 and it is stored in a Bitbucket
> repository. We might be able to add a certain number of co-developers.
> We should talk about that.
>
>> Of course, the question of recommended formats is quite an urgent one in
>> this context, and if understood correctly, you do not want to create
>> yet-another generic formalism. But when focusing on established formats,
>> the choice depends on what you want to achieve. For merely depositing
>> language resources, a restricted choice of formats with offline tool
>> support is a great solution. To mention one format not on your list
>> below, Toolbox is perfectly fine for this purpose.
>
> We have a broader goal here with GORILLA. It is not just about storing
> language documentation data. We are converting this data to common
> corpus formats. So the speech recordings and transcriptions we convert
> to common speech corpus formats and extract data and models from those
> to be able to train different types of Forced Alignment tools or basic
> ASRs. We develop computational morphologies for some of the languages,
> basic CFGs, even Feature Grammars that can be useful for LFG parsers,
> etc. We will provide the corpora, data and technologies for the
> different languages on GORILLA, e.g. a first Yiddish speech corpus of
> several hours with two functioning Forced Aligners and a basic set of
> models for ASR development, same for Chatino, Burmese, etc. The data is
> being PoS-tagged and word-by-word and utterance-translated (to English,
> partially to German too).
>
> The goals are, among others:
>
> - to archive digital language data, and develop an archive
> infrastructure that allows for global and interconnected repos that can
> mirror each other, sync annotations and extensions, etc.
>
> - bring the language data into some format that allows for
> cross-linguistic and cross-linguistic level analyses and studies (some
> level of compatibility of data encoding and interoperable annotation
> required), our large research goal being a snapshot of all languages
> that grows over time
>
> - bring the data into various common formats for speech and language
> technology tools and environments (enable it as a training/testing
> corpus, bootstrap technologies for low-resourced or endangered
> languages, as well as offer resources for common languages that so far
> are inaccessible to economically challenged members of the global
> community, researchers, or corpus or NLP-interested people)
>
> The story is complicated and long. Hopefully we will get to a paper on
> that, and maybe we will have an opportunity to present this some time
> somewhere.
>
> Right now we have started launching the initial infrastructure,
> negotiating various necessary components, creating first corpora and
> data-sets that we can present online, working out documents, legal
> arrangements at our hosting institution, etc.
>
>> If you want to go beyond this in the longer perspective, e.g., by
>> encouraging an (L)LOD conversion for selected resources, or by providing
>> online search in these language resources, (...) have interoperability
>> issues, and we need to discuss and document advantages and disadvantages
>> of using one or the other before promoting one of these to yet another
>> de-facto standard.
>>
>> What I would like to suggest is to initiate a discussion with respect to
>> which formats to recommend (not necessarily now, but as soon as first
>> data sets are available, and certainly not exclusively to this mailing
>> list), and a good way to do so would be to present the platform and its
>> goals as soon as a minimal level of maturity is achieved, on our
>> blog/website or at one of our workshops (e.g., LDL-2016). Please let me
>> know what you think ;)
>
> We are submitting some abstracts to the next LREC, we could have a
> spontaneous workshop at the beach in Portoroz, should at least one of
> those be accepted... :-) Or, we could organize some alternative location
> and environment for an LREC chat on that topic. We would be happy to
> participate at LDL and discuss some of that.
>
> We would also be happy to include some of you in a proposal, or get
> included in yours, applying for funding to support the development of an
> infrastructure, resources, network of intercontinental mirrors and
> networked systems for language data, etc.
>
> Thanks!
>
> All the best
>
> Damir
>
>
>
>

-- 
wwwhomes.uni-bielefeld.de/gibbon



More information about the open-linguistics mailing list