[ckan-dev] Problem with harvested ISO19139 records containing non-English characters
Brian Bonnlander
bonnland at ucar.edu
Fri Jun 16 22:36:03 UTC 2017
UPDATE:
I think there is another possible explanation: the string is
double-encoded. The logs are showing that the title string is stored in
"extras" this way: 'title': u'Normales climatol\xc3\xb3gicas'. If the
unicode encoding at the start of the string weren't there, then the
string prints out correctly:
>>> x = u'Normales climatol\xc3\xb3gicas'
>>> print(x)
Normales climatológicas
>>> x = 'Normales climatol\xc3\xb3gicas'
>>> print(x)
Normales climatológicas
Is there a way I can change the spatial harvester behavior so that it
stores the title string without a double-encoding?
Any help is greatly appreciated! Thanks,
--Brian
On 6/9/17 4:49 PM, Brian Bonnlander wrote:
>
> Hi CKAN developers,
>
> I'm running into a problem with harvesting non-English characters from
> ISO 19139 documents.
>
> We using version 2.5.5 of CKAN with recently pulled versions of
> ckanext-harvest and ckanext-spatial. We're using the WAF harvester
> from ckanext-harvest.
>
> What's interesting is that when I directly enter the non-English text
> into the CKAN dataset entry form for the Title and Notes, the
> resulting dataset view shows the characters correctly. Yet the
> dataset view for the harvested version displays the characters
> incorrectly.
>
> I've checked our Postgres database settings. The client_encoding is
> set to 'UTF8' and the server_encoding is set to 'UNICODE'. Table
> encodings are UTF-8, as the install instructions indicate.
>
>
> When I make API requests for the two different versions of the record,
> I get back slightly different encodings for the non-English characters.
>
>
> The title string for the directly entered record contains: "Normales
> climatol\u00f3gicas". The encoded character displays in CKAN as an
> "o" with an accent over it, which is what we are aiming for.
>
> The title string for the harvested record contains: "Normales
> climatol\u00c3\u00b3gicas". The encoded characters display in CKAN
> as an "A" with a tilde over it, followed by a superscript "3".
>
>
> I could be wrong, but it seems possible that for the harvested record,
> the two encoded characters are a UTF-16 representation of "o" with an
> accent over it. This leads me to think that we are not specifying
> some configuration setting correctly, but that is a total guess.
>
>
> I've looked all over for people running into similar issues, but I
> haven't found anything that discusses this kind of problem.
>
> I hope someone is able to harvest the attached record and give
> insights into possible configuration problems that we may have, or
> have other insights.
>
>
> Thank you!
>
> --Brian
>
More information about the ckan-dev
mailing list