[ckan-dev] Harvest/Spatial Harvest Unicode Character Issues

Tue Oct 31 04:03:51 UTC 2017

Yes that would be the best place to track this bug. I quickly checked the
code and it seems to be aware of this issue (see
https://github.com/ckan/ckanext-spatial/blob/7970f2b656e5e35d9a0e539afc61679cf63861df/ckanext/spatial/harvesters/base.py#L771
).

Are you using the latest version of ckanext-spatial? What method do you use
to harvest? CSW?

- Stefan

On Oct 26, 2017 17:52, "Nathan Hook" <nhook at ucar.edu> wrote:

> A friendly 'bump' on this issue.  Should I create a bug report in the ckan
> harvesting/spatial harvester github repositories?
>
> Thank you for your time,
>
> Nathan
>
> On Tue, Oct 10, 2017 at 2:40 PM, Nathan Hook <nhook at ucar.edu> wrote:
>
>> Good Day,
>>
>> We are experiencing unicode encoding issues when harvesting xml ISO
>> documents via the harvest/spatial harvest plugins.
>>
>> If we harvest an xml ISO document that has unicode characters in it, the
>> unicode characters seem to get broken down into their base characters
>> instead of being considered the original unicode character.
>>
>> Here is an example.
>>
>> If we have the following abstract in our iso xml:
>> ° DEGREE SIGN \u00b0 ⁻ SUPERSCRIPT MINUS \u207b ³ SUPERSCRIPT THREE
>> \u00b3 µ MICRO SIGN \u00b5 ₂ SUBSCRIPT 2 \u2082
>>
>> We will see the following on our ckan page:
>> Â° DEGREE SIGN \u00b0 â » SUPERSCRIPT MINUS \u207b Â³ SUPERSCRIPT THREE
>> \u00b3 Âµ MICRO SIGN \u00b5 â‚‚ SUBSCRIPT 2 \u2082
>>
>> Which is not the desired outcome.
>>
>> It looks as though the ISO xml file is being interpreted as an ASCII or
>> Latin-1 encoded file instead of utf-8.
>>
>> We can kind of prove this by looking at the characters (bytes) involved
>> with making the example unicode characters.
>>
>> DEGREE SIGN gets broken downing to the following bytes:  c2 b0
>>
>> Then looking at the Latin-1 character set table:
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>
>> We can see the following:
>> c2 = Uppercase a-circumflex
>> b0 = Degree Symbol
>>
>>
>> Breaking down the rest of the characters:
>>
>> Superscript Minus: e2 81 bb (Lowercase a-circumflex, ascii control
>> character, Guillemet)
>> Superscript Three: c2 b3 (Uppercase a-circumflex, Cube)
>> Micro Sign: c2 b5 (Uppercase a-circumflex, Micro Sign)
>> Subscript 2:  e2 82 82 (Lowercase a-circumflex, ascii control character,
>> ascii control character)
>>
>>
>> My guess is that the harvester code is assuming latin-1 (or similar)
>> encoding and reading the file in that way instead of assuming utf-8.
>>
>> Here are a couple of very simple python code examples of what could be
>> happening:
>> >>> print '°'.decode("latin-1")
>> Â°
>> >>> '°'.decode("latin-1")
>> u'\xc2\xb0'
>> >>> '°'
>> '\xc2\xb0'
>>
>>
>> I have attached an example ISO xml file that shows the problem we are
>> running into.
>>
>> Is this a bug with the harvest/spatial harvest code?  Or are we doing
>> something wrong on our end?
>>
>> And friendly knowledge or advice would be greatly appreciated.
>>
>> Thank you for your time,
>>
>> Nathan
>>
>>
>>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20171031/44ac1c20/attachment-0003.html>