[open-linguistics] Question: replacing language codes in a SPARQL BIND statement?
Christian Chiarcos
chiarcos at informatik.uni-frankfurt.de
Thu Mar 17 11:20:44 UTC 2016
Hi Felix,
thanks for correcting me, I was oversimplifying with a hypothetical
example, and wrongly, actually. In fact, BCP 47 states that
"When languages have both an ISO 639-1 two-character code and a three-
character code (assigned by ISO 639-2, ISO 639-3, or ISO 639-5), only
the ISO 639-1 two-character code is defined in the IANA registry."
> xml:lang allows only for BCP 47 language tags, and here the options you
> describe (e.g. ISO-639-3 vs. IS0-639-2) are not available. So if you use
> a language >tag validator you can at least detect that an xml:lang value
> is not valid.
The conversion issue, however, remains with BCP 47, as soon as extended
language subtags are involved:
"Extended language subtags are used to identify certain specially selected
languages that, for various historical and compatibility reasons, are
closely identified with or tagged using an existing primary language
subtag. Extended language subtags are always used with their enclosing
primary language subtag (indicated with a 'Prefix' field in the registry)
when used to form the language tag. ...
For example, the macrolanguage Chinese ('zh') encompasses a number of
languages. For compatibility reasons, each of these languages has both a
primary and extended language subtag in the registry. A few selected
examples of these include Gan Chinese ('gan'), Cantonese Chinese ('yue'),
and Mandarin Chinese ('cmn'). Each is encompassed by the macrolanguage
'zh' (Chinese). Therefore, they each have the prefix "zh" in their
registry records. Thus, Gan Chinese is represented with tags beginning
"zh-gan" or "gan", Cantonese with tags beginning either "yue" or "zh-yue",
and Mandarin Chinese with "zh-cmn" or "cmn"."
Quotes from http://www.rfc-editor.org/rfc/bcp/bcp47.txt (resp.
https://tools.ietf.org/html/rfc5646).
> https://validator.w3.org/#validate_by_input
The validator actually complains about "zh-gan": "Potentially bad value
zh-gan for attribute lang on element html: The language tag zh-gan is
deprecated. Use gan instead." (This might be incorrect as it refers to the
very text from which I got it as recommendation, see above.)
But anyway, there is a 3-letter-to-2-letter conversion required, if we
want to treat lexical forms from sub-varieties of Chinese (like gan) like
"ordinary" zh.
> But the underlying library
> https://about.validator.nu/
> has a class to validate language tags on its own.
That will certainly help. Thanks to all for responding, I have a much
clearer picture of language tags now.
Thanks a lot,
Christian
--
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20160317/4c58b10a/attachment-0003.html>
More information about the open-linguistics
mailing list