[ckan-dev] Problem with harvested ISO19139 records containing non-English characters
Brian Bonnlander
bonnland at ucar.edu
Mon Jun 19 23:09:02 UTC 2017
I may have found a solution to the problem: convert WAF contents to the
"latin1" character set. It seems that harvesting code assumes this
character set for files, and it does the proper UTF-8 encoding for these
types of files.
So my solution was to use the UNIX command "iconv" to convert the file's
character set from "utf-8" to "iso-8859-1". Then the characters showed
up correctly in CKAN.
$ cp ISO19139_File.xml TEMP
$ iconv -f utf-8 -t iso-8859-1 < TEMP > ISO19139_File.xml
$ rm TEMP
After updating the dateStamp in the file and reharvesting, the
characters looked correct.
--Brian
On 6/16/17 4:36 PM, Brian Bonnlander wrote:
> UPDATE:
>
> I think there is another possible explanation: the string is
> double-encoded. The logs are showing that the title string is stored
> in "extras" this way: 'title': u'Normales climatol\xc3\xb3gicas'. If
> the unicode encoding at the start of the string weren't there, then
> the string prints out correctly:
>
> >>> x = u'Normales climatol\xc3\xb3gicas'
> >>> print(x)
> Normales climatológicas
>
> >>> x = 'Normales climatol\xc3\xb3gicas'
> >>> print(x)
> Normales climatológicas
>
> Is there a way I can change the spatial harvester behavior so that it
> stores the title string without a double-encoding?
>
>
> Any help is greatly appreciated! Thanks,
>
> --Brian
>
>
> On 6/9/17 4:49 PM, Brian Bonnlander wrote:
>>
>> Hi CKAN developers,
>>
>> I'm running into a problem with harvesting non-English characters
>> from ISO 19139 documents.
>>
>> We using version 2.5.5 of CKAN with recently pulled versions of
>> ckanext-harvest and ckanext-spatial. We're using the WAF harvester
>> from ckanext-harvest.
>>
>> What's interesting is that when I directly enter the non-English text
>> into the CKAN dataset entry form for the Title and Notes, the
>> resulting dataset view shows the characters correctly. Yet the
>> dataset view for the harvested version displays the characters
>> incorrectly.
>>
>> I've checked our Postgres database settings. The client_encoding is
>> set to 'UTF8' and the server_encoding is set to 'UNICODE'. Table
>> encodings are UTF-8, as the install instructions indicate.
>>
>>
>> When I make API requests for the two different versions of the
>> record, I get back slightly different encodings for the non-English
>> characters.
>>
>>
>> The title string for the directly entered record contains: "Normales
>> climatol\u00f3gicas". The encoded character displays in CKAN as an
>> "o" with an accent over it, which is what we are aiming for.
>>
>> The title string for the harvested record contains: "Normales
>> climatol\u00c3\u00b3gicas". The encoded characters display in CKAN
>> as an "A" with a tilde over it, followed by a superscript "3".
>>
>>
>> I could be wrong, but it seems possible that for the harvested
>> record, the two encoded characters are a UTF-16 representation of "o"
>> with an accent over it. This leads me to think that we are not
>> specifying some configuration setting correctly, but that is a total
>> guess.
>>
>>
>> I've looked all over for people running into similar issues, but I
>> haven't found anything that discusses this kind of problem.
>>
>> I hope someone is able to harvest the attached record and give
>> insights into possible configuration problems that we may have, or
>> have other insights.
>>
>>
>> Thank you!
>>
>> --Brian
>>
>
More information about the ckan-dev
mailing list