[open-linguistics] Linked Open Data and Endangered Languages

Mon Oct 19 14:43:20 UTC 2015

Hi there,

we are right now setting up the environment and tools to launch
something like that, a linked repository for language data and corpora,
covering among others also endangered language data (processed using eg.
ELAN, Praat, or Flex). On GORILLA

http://gorilla.linguistlist.org/about/

we will set up our own data for a number of languages, i.e.
time-aligned, tagged and translated ELAN files (and corresponding audio
recordings), transcoded and augmented Praat files, and plain TIMIT-type
of speech corpus files etc. with acoustic and language models for
training of Forced Aligners or common ASRs. This is aiming at corpora as
well as speech and language technologies for endangered and
low-resourced languages, and it will serve also as a repository for
language documentation data in general. Two corpora that we created this
summer that are from endangered languages, a bunch more from
low-resourced ones. Since we are busy setting up some of the technology,
it might still take us a month or two to come up with the first data and
catalog.

There are various possibilities. We could host people's data (maybe for
the time being, till you establish a mirror), attach CMDI
meta-information to it, assign it DOIs, and put it up in a linked
catalog that also connects to the CLARIN (and OAI) infrastructure (we
did not set this up yet, just solved the DOI issue right now), and then
we can generate the RDF and so on.

We will come up with more tutorials and material on our site that might
be helpful for you to set up your own repository, maybe mirror ours, so
that we have cross-continental redundancy and backups etc.

Our requirements are that the data is at least CC BY-SA (or CC BY, but
we would not accept an NC clause), if we are to invest in the above
described process of meta-information and linking. We can take any kind
of data into the archive, though. Our GORILLA goal is to document as
many languages as possible using audio or video recordings,
transcription, phonetic or phonemic transcription, PoS-tagging,
translation. We are using a uniform standard (or two, e.g. ELAN
annotation format, Praat TextGrid, TEI XML) and interoperable linguistic
tags or annotations (linking to GOLD and/or ISOCat).

Ritesh, I do not know, whether I could make it to Agra in February, but
I will be somewhere closer there next year spring. We could talk about
that over a private line.

Best wishes

Damir

-- 
Damir Cavar
Dept. of Linguistics, Indiana University
Co-director and Moderator, The LINGUIST List
https://linguistlist.org/people/damir_cavar.html

On 10/13/2015 04:22 AM, ANTONIO PAREJA LORA wrote:
> Dear Ritesh,
>         We addressed quite similar problems in the LLOD at LSA 2015
> workshop (Development of Linguistic Linked Open Data (LLOD) Resources
> for Collaborative Data-Intensive Research in the Language Sciences - LSA
> Summer Institute 2015, held in Chicago in July). You might want to have
> a look at some of the slides published on the workshop
> website: http://quijote.fdi.ucm.es:8084/LLOD-LSASummerWorkshop2015/Program.html.
> I think they can be a nice starting point for you.
>         Best,
>                          Antonio.
> 
> 2015-10-12 16:25 GMT+02:00 Ritesh <riteshkrjnu at gmail.com
> <mailto:riteshkrjnu at gmail.com>>:
> 
>     Dear members,
> 
>     I have been trying to read about and learn how to convert language data
>     ,
>     that is either in XML format or in plain text format
>     ,
>     into Linked Open Data. What I have understood is that it needs to be
>     represented in RDF format and made accessible over the web (and also
>     link it to other similar resources, if possible). However, there has
>     been too much of information, too many frameworks to deal with and I
>     must admit I am pretty lost.
> 
>     What I am some of my colleagues are trying to do is very simple - we
>     have a large amount of audio (and also video) recordings, with
>     inter-linear glossing of quite a few critically endangered languages
>     of India. When these data were collected (and that means even
>     today), they were collected as part of a documentation project to
>     preserve as much as possible of these dying languages (and may be
>     use the language data to revitalise the languages). Most of the data
>     is in XML format which is created by two of the most used softwares
>     in field linguistics and language documentation - SIL FieldWorks and
>     ELAN Video Annotation Tool. So the data is structured but not Linked
>     Open Data. We are trying to export and publish this data as Linked
>     Open Data.
> 
>     Now the problem is - none of us really understands RDF or Linked
>     Open Data that well. Theoretically we understand that RDF maintains
>     the semantics of any document, thereby, making interoperability
>     possible but that is pretty much all we know. We are still not able
>     to figure out how exactly could this be done. Any pointers towards
>     what exactly is Linked Open Data and how could we convert data into
>     Linked Open Data would be very helpful. Of course, there are a large
>     number of resources available on the web but they are a bit too much
>     - most of the times we end up more confused than ever. So we would
>     appreciate something which gives an overview of this and may be also
>     some indications / guidelines as to how we could approach this.
> 
>     In addition to this we were also wondering if somebody in this group
>     would be interested in delivering a 'revelation' talk or may be,
>     giving some kind of tutorial / workshop / training on how exactly
>     this could be done. We are organising a conference on Language
>     Technologies for Endangered Languages from 25 - 27 February, 2016 in
>     Agra
>     
>     , India
>      
>     
>     (Conference Website <http://elkl4.kmiagra.in/>)
>     
>     and we would like it to be done there so that maximum number of
>     people could be benefited. Please let me know and we could talk
>     about this on the personal email (without spamming the inbox of the
>     subscribers to this list).
> 
>     Thanks & Best regards,
> 
>     -- 
>     Ritesh Kumar, Ph.D.
>     Assistant Professor
>     Department of Linguistics
>     Dr. B.R. Ambedkar University
>     Agra, India
>