[ckan-dev] Problem with harvested ISO19139 records containing non-English characters

Brian Bonnlander bonnland at ucar.edu
Fri Jun 9 22:49:23 UTC 2017


Hi CKAN developers,

I'm running into a problem with harvesting non-English characters from 
ISO 19139 documents.

We using version 2.5.5 of CKAN with recently pulled versions of 
ckanext-harvest and ckanext-spatial.   We're using the WAF harvester 
from ckanext-harvest.

What's interesting is that when I directly enter the non-English text 
into the CKAN dataset entry form for the Title and Notes, the resulting 
dataset view shows the characters correctly.   Yet the dataset view for 
the harvested version displays the characters incorrectly.

I've checked our Postgres database settings.  The client_encoding is set 
to 'UTF8' and the server_encoding is set to 'UNICODE'.   Table encodings 
are UTF-8, as the install instructions indicate.


When I make API requests for the two different versions of the record, I 
get back slightly different encodings for the non-English characters.


The title string for the directly entered record contains: "Normales 
climatol\u00f3gicas".   The encoded character displays in CKAN as an "o" 
with an accent over it, which is what we are aiming for.

The title string for the harvested record contains:  "Normales 
climatol\u00c3\u00b3gicas".   The encoded characters display in CKAN as 
an "A" with a tilde over it, followed by a superscript "3".


I could be wrong, but it seems possible that for the harvested record, 
the two encoded characters are a UTF-16 representation of "o" with an 
accent over it.   This leads me to think that we are not specifying some 
configuration setting correctly, but that is a total guess.


I've looked all over for people running into similar issues, but I 
haven't found anything that discusses this kind of problem.

I hope someone is able to harvest the attached record and give insights 
into possible configuration problems that we may have, or have other 
insights.


Thank you!

--Brian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf8_test_iso19139.xml
Type: text/xml
Size: 8676 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20170609/ebf2bc13/attachment-0002.xml>


More information about the ckan-dev mailing list