[ckan-dev] Problem with harvested ISO19139 records containing non-English characters

Mon Jun 19 23:09:02 UTC 2017

I may have found a solution to the problem:  convert WAF contents to the 
"latin1" character set.   It seems that harvesting code assumes this 
character set for files, and it does the proper UTF-8 encoding for these 
types of files.

So my solution was to use the UNIX command "iconv" to convert the file's 
character set from "utf-8" to "iso-8859-1".  Then the characters showed 
up correctly in CKAN.

$ cp ISO19139_File.xml TEMP
$ iconv -f utf-8 -t iso-8859-1 < TEMP > ISO19139_File.xml
$ rm TEMP

After updating the dateStamp in the file and reharvesting, the 
characters looked correct.

--Brian

On 6/16/17 4:36 PM, Brian Bonnlander wrote:
> UPDATE:
>
> I think there is another possible explanation: the string is 
> double-encoded. The logs are showing that the title string is stored 
> in "extras" this way: 'title': u'Normales climatol\xc3\xb3gicas'. If 
> the unicode encoding at the start of the string weren't there, then 
> the string prints out correctly:
>
> >>> x = u'Normales climatol\xc3\xb3gicas'
> >>> print(x)
> Normales climatolÃ³gicas
>
> >>> x = 'Normales climatol\xc3\xb3gicas'
> >>> print(x)
> Normales climatológicas
>
> Is there a way I can change the spatial harvester behavior so that it 
> stores the title string without a double-encoding?
>
>
> Any help is greatly appreciated!   Thanks,
>
> --Brian
>
>
> On 6/9/17 4:49 PM, Brian Bonnlander wrote:
>>
>> Hi CKAN developers,
>>
>> I'm running into a problem with harvesting non-English characters 
>> from ISO 19139 documents.
>>
>> We using version 2.5.5 of CKAN with recently pulled versions of 
>> ckanext-harvest and ckanext-spatial.   We're using the WAF harvester 
>> from ckanext-harvest.
>>
>> What's interesting is that when I directly enter the non-English text 
>> into the CKAN dataset entry form for the Title and Notes, the 
>> resulting dataset view shows the characters correctly.   Yet the 
>> dataset view for the harvested version displays the characters 
>> incorrectly.
>>
>> I've checked our Postgres database settings.  The client_encoding is 
>> set to 'UTF8' and the server_encoding is set to 'UNICODE'. Table 
>> encodings are UTF-8, as the install instructions indicate.
>>
>>
>> When I make API requests for the two different versions of the 
>> record, I get back slightly different encodings for the non-English 
>> characters.
>>
>>
>> The title string for the directly entered record contains: "Normales 
>> climatol\u00f3gicas".   The encoded character displays in CKAN as an 
>> "o" with an accent over it, which is what we are aiming for.
>>
>> The title string for the harvested record contains:  "Normales 
>> climatol\u00c3\u00b3gicas".   The encoded characters display in CKAN 
>> as an "A" with a tilde over it, followed by a superscript "3".
>>
>>
>> I could be wrong, but it seems possible that for the harvested 
>> record, the two encoded characters are a UTF-16 representation of "o" 
>> with an accent over it.   This leads me to think that we are not 
>> specifying some configuration setting correctly, but that is a total 
>> guess.
>>
>>
>> I've looked all over for people running into similar issues, but I 
>> haven't found anything that discusses this kind of problem.
>>
>> I hope someone is able to harvest the attached record and give 
>> insights into possible configuration problems that we may have, or 
>> have other insights.
>>
>>
>> Thank you!
>>
>> --Brian
>>
>