[ckan-dev] Problem with harvested ISO19139 records containing non-English characters

Fri Jun 16 22:36:03 UTC 2017

UPDATE:

I think there is another possible explanation: the string is 
double-encoded. The logs are showing that the title string is stored in 
"extras" this way: 'title': u'Normales climatol\xc3\xb3gicas'. If the 
unicode encoding at the start of the string weren't there, then the 
string prints out correctly:

 >>> x = u'Normales climatol\xc3\xb3gicas'
 >>> print(x)
Normales climatolÃ³gicas

 >>> x = 'Normales climatol\xc3\xb3gicas'
 >>> print(x)
Normales climatológicas

Is there a way I can change the spatial harvester behavior so that it 
stores the title string without a double-encoding?

Any help is greatly appreciated!   Thanks,

--Brian

On 6/9/17 4:49 PM, Brian Bonnlander wrote:
>
> Hi CKAN developers,
>
> I'm running into a problem with harvesting non-English characters from 
> ISO 19139 documents.
>
> We using version 2.5.5 of CKAN with recently pulled versions of 
> ckanext-harvest and ckanext-spatial.   We're using the WAF harvester 
> from ckanext-harvest.
>
> What's interesting is that when I directly enter the non-English text 
> into the CKAN dataset entry form for the Title and Notes, the 
> resulting dataset view shows the characters correctly.   Yet the 
> dataset view for the harvested version displays the characters 
> incorrectly.
>
> I've checked our Postgres database settings.  The client_encoding is 
> set to 'UTF8' and the server_encoding is set to 'UNICODE'. Table 
> encodings are UTF-8, as the install instructions indicate.
>
>
> When I make API requests for the two different versions of the record, 
> I get back slightly different encodings for the non-English characters.
>
>
> The title string for the directly entered record contains: "Normales 
> climatol\u00f3gicas".   The encoded character displays in CKAN as an 
> "o" with an accent over it, which is what we are aiming for.
>
> The title string for the harvested record contains:  "Normales 
> climatol\u00c3\u00b3gicas".   The encoded characters display in CKAN 
> as an "A" with a tilde over it, followed by a superscript "3".
>
>
> I could be wrong, but it seems possible that for the harvested record, 
> the two encoded characters are a UTF-16 representation of "o" with an 
> accent over it.   This leads me to think that we are not specifying 
> some configuration setting correctly, but that is a total guess.
>
>
> I've looked all over for people running into similar issues, but I 
> haven't found anything that discusses this kind of problem.
>
> I hope someone is able to harvest the attached record and give 
> insights into possible configuration problems that we may have, or 
> have other insights.
>
>
> Thank you!
>
> --Brian
>