[ckan-dev] Problem with harvested ISO19139 records containing non-English characters
Brian Bonnlander
bonnland at ucar.edu
Fri Jun 9 22:49:23 UTC 2017
Hi CKAN developers,
I'm running into a problem with harvesting non-English characters from
ISO 19139 documents.
We using version 2.5.5 of CKAN with recently pulled versions of
ckanext-harvest and ckanext-spatial. We're using the WAF harvester
from ckanext-harvest.
What's interesting is that when I directly enter the non-English text
into the CKAN dataset entry form for the Title and Notes, the resulting
dataset view shows the characters correctly. Yet the dataset view for
the harvested version displays the characters incorrectly.
I've checked our Postgres database settings. The client_encoding is set
to 'UTF8' and the server_encoding is set to 'UNICODE'. Table encodings
are UTF-8, as the install instructions indicate.
When I make API requests for the two different versions of the record, I
get back slightly different encodings for the non-English characters.
The title string for the directly entered record contains: "Normales
climatol\u00f3gicas". The encoded character displays in CKAN as an "o"
with an accent over it, which is what we are aiming for.
The title string for the harvested record contains: "Normales
climatol\u00c3\u00b3gicas". The encoded characters display in CKAN as
an "A" with a tilde over it, followed by a superscript "3".
I could be wrong, but it seems possible that for the harvested record,
the two encoded characters are a UTF-16 representation of "o" with an
accent over it. This leads me to think that we are not specifying some
configuration setting correctly, but that is a total guess.
I've looked all over for people running into similar issues, but I
haven't found anything that discusses this kind of problem.
I hope someone is able to harvest the attached record and give insights
into possible configuration problems that we may have, or have other
insights.
Thank you!
--Brian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf8_test_iso19139.xml
Type: text/xml
Size: 8676 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20170609/ebf2bc13/attachment-0002.xml>
More information about the ckan-dev
mailing list