[open-linguistics] Question: replacing language codes in a SPARQL BIND statement?

Christian Chiarcos chiarcos at informatik.uni-frankfurt.de
Thu Mar 17 11:20:44 UTC 2016


Hi Felix,

thanks for correcting me, I was oversimplifying with a hypothetical  
example, and wrongly, actually. In fact, BCP 47 states that

  "When languages have both an ISO 639-1 two-character code and a three-
character code (assigned by ISO 639-2, ISO 639-3, or ISO 639-5), only
the ISO 639-1 two-character code is defined in the IANA registry."

> xml:lang allows only for BCP 47 language tags, and here the options you  
> describe (e.g. ISO-639-3 vs. IS0-639-2) are not available. So if you use  
> a language >tag validator you can at least detect that an xml:lang value  
> is not valid.

The conversion issue, however, remains with BCP 47, as soon as extended  
language subtags are involved:

"Extended language subtags are used to identify certain specially selected  
languages that, for various historical and compatibility reasons, are  
closely identified with or tagged using an existing primary language  
subtag. Extended language subtags are always used with their enclosing  
primary language subtag (indicated with a 'Prefix' field in the registry)  
when used to form the language tag. ...
For example, the macrolanguage Chinese ('zh') encompasses a number of  
languages. For compatibility reasons, each of these languages has both a  
primary and extended language subtag in the registry. A few selected  
examples of these include Gan Chinese ('gan'), Cantonese Chinese ('yue'),  
and Mandarin Chinese ('cmn'). Each is encompassed by the macrolanguage  
'zh' (Chinese). Therefore, they each have the prefix "zh" in their  
registry records. Thus, Gan Chinese is represented with tags beginning  
"zh-gan" or "gan", Cantonese with tags beginning either "yue" or "zh-yue",  
and Mandarin Chinese with "zh-cmn" or "cmn"."

Quotes from http://www.rfc-editor.org/rfc/bcp/bcp47.txt (resp.   
https://tools.ietf.org/html/rfc5646).

> https://validator.w3.org/#validate_by_input

The validator actually complains about "zh-gan": "Potentially bad value  
zh-gan for attribute lang on element html: The language tag zh-gan is  
deprecated. Use gan instead." (This might be incorrect as it refers to the  
very text from which I got it as recommendation, see above.)

But anyway, there is a 3-letter-to-2-letter conversion required, if we  
want to treat lexical forms from sub-varieties of Chinese (like gan) like  
"ordinary" zh.

> But the underlying library
> https://about.validator.nu/
> has a class to validate language tags on its own.

That will certainly help. Thanks to all for responding, I have a much  
clearer picture of language tags now.

Thanks a lot,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos at informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-linguistics/attachments/20160317/4c58b10a/attachment-0003.html>


More information about the open-linguistics mailing list