[open-linguistics] CfP: Grammar Data Mining (GDM): Extracting Linguistic Features From Grammatical Descriptions, September 5-6, 2019 - Varna, Bulgaria
Sebastian Nordhoff
sebastian.nordhoff at glottotopia.de
Fri Apr 5 06:59:31 UTC 2019
Call for papers
Grammar Data Mining (GDM): Extracting Linguistic Features From
Grammatical Descriptions
September 5-6, 2019 - Varna, Bulgaria
Submission deadline: 30 June 2019
Link: https://spraakbanken.gu.se/lsi/sharedtask/
Description
-----------
The present Workshop/Shared Task seeks to transform a large set of
digitized publications describing the grammars of the languages of the
world into structured databases that will enable comparison of
different languages at an unprecedented breadth and depth.
There are some 6 500 languages in the world and information about
their grammatical characteristics is available in book-form for over
4 000 of them. Until recently, extraction of information from grammars
has been done exclusively through manual collection. This procedure is
naturally bounded by the limits of human capacities, and as such can
only target a relatively small amount of languages/characteristics at
a substantial time investment in a given time.
We are now entering a phase where it is practical to use NLP tools for
a number of similar tasks. A computer may minimally infer some
characteristics of the language described simply by counting words
used in a grammatical description, e.g., a high-frequency of the term
’suffix’ likely indicates that the language being described uses a lot
of suffixes. Further, there are less straightforward or more detailed
characteristics traditionally of interest to linguists, such as where
the verb is placed in then sentence (beginning, middle, end), the
existence and use of participles, possessive constructions,
evidentiality and so on. Any techniques from the NLP toolbox such as
td-idf-weighting, tagging, parsing and vector spaces may be used in
combination and as input in more sophisticated Machine Learning
approaches.
In this shared task we provide a subset of the World Atlas of Language
Structures (WALS, http://wals.info) along with the digitized sources
from which the features were drawn. Sources are provided in raw text
form. The task is to infer WALS datapoints from the raw text data of
the digitized grammatical descriptions.
Training Data
-------------
10 000 datapoints spanning 191 languages and 100 features along with their
value and source(s) are given as training in the following form:
Language ISO 639-3 Feature Value Source
----------------------------------------------------------------------------------------
Macushi mbc 31A Sex-based and Non-sex-based
Abbott-1991[105-106]
Non-sex-based Gender Systems
Macushi mbc 57A Position of Pronominal Possessive prefixes
Abbott-1991[85,101];
Possessive Affixes
Williams-1932[61];
Carson-1982[104-106]
E. Oromo hae 118A Predicative Adjectives Mixed
Owens-1985
E. Oromo hae 9A The Velar Nasal No velar nasal
Owens-1985[10]
... ... ... ... ...
Features and values are defined as per WALS
(http://wals.info). Sources are semi-colon separated and optionally
indicate a page range in square brackets. Each source maps uniquely to
an entry with bibliographical details in a bibtex-file and to a
full-text of the source in question. The full-text is an OCR of a scan
of the original source (varying quality) and contains no
formatting. OCR errors are present, especially for IPA- or
non-ascii-script text in a vernacular. There is a total of 443 source
texts supplied.
The training data can be downloaded at
http://stp.lingfil.uu.se/~harald/grammar-data-mining.zip
Task
----
The task is to provide the Value for an unseen Language-Feature-Source
triple.
No language-specific data source external to the training data
(such as the classifcation of a language, other sources for a language
etc.) may be used. However, other open generic linguistic data sources
may be utilized (such as the raw text of the corresponding WALS
chapter, a list of linguistic terms etc.).
Not every possible value for every feature is attested in the training
data set but systems should nevertheless strive to potentially output
any of the possible values for a features as defined in WALS. It is
not obligatory that the training set values are utilized at all.
Submission Instructions
-----------------------
Authors should submit a paper of up to 8 pages conforming to the RANLP
style guidelines (see http://lml.bas.bg/ranlp2019/submissions.php)
describing their technical solution to the specific task. The
submission should contain a link to a runnable version (e.g. on
github.com) of the authors’ solution. This runnable should output a
Value (and nothing else) upon running the system: e.g. Given a
language-code, the feature of interest, and the source document, the
system should output the feature value as examplified below:
>>>python grammar-data-mining.py "hae" "118A Predicative Adjectives"
"Owens-1985; Heine 1981"
Mixed
Submission is electronic, using the Softconf submission system for the
Grammar Data Mining Workshop at https://www.softconf.com/ranlp2019/GDM/
Papers must be written in English.
Submitted papers will be peer-reviewed by three experts from a related
field.
At least one author of each accepted paper is required to register for
the RANLP 2019 conference, attend the workshop, and present the paper.
Important Dates
---------------
Workshop paper submission deadline: 30 June 2019
Workshop paper acceptance notification: 28 July 2019
Workshop paper camera-ready version: 20 August 2019
Workshop: 5-6 September 2019
Evaluation
----------
Each submission will be evaluated against a test set of 1000 random
datapoints drawn from the same origin as the training data set. The
test set will not be made available until after submission. Other
aspects than accuracy (such as running time) will not be evaluated.
Programme Committee
-------------------
Guillaume Segerer (CNRS, LLACAN, France)
Harald Hammarström (Department of Linguistics and Philology, Uppsala
University, Sweden)
Markus Forsberg (Språkbanken, University of Gothenburg, Sweden)
Søren Wichmann (Leiden University Centre for Linguistics, Netherlands)
Shafqat Mumtaz Virk (Språkbanken, University of Gothenburg, Sweden)
Zeljko Agic (IT University of Copenhagen, Denmark)
Erich Round (University of Queensland, Australia)
Sebastian Nordhoff (LangSci Press, Germany)
Venue
-----
The workshop will be co-located with RANLP http://lml.bas.bg/ranlp2019
in Bulgaria
and take place in Hotel "Cherno More", Varna, the main RANLP-2019
conference venue.
More information about the open-linguistics
mailing list