[ckan-dev] Harvest/Spatial Harvest Unicode Character Issues

Thu Oct 26 15:52:24 UTC 2017

A friendly 'bump' on this issue.  Should I create a bug report in the ckan
harvesting/spatial harvester github repositories?

Thank you for your time,

Nathan

On Tue, Oct 10, 2017 at 2:40 PM, Nathan Hook <nhook at ucar.edu> wrote:

> Good Day,
>
> We are experiencing unicode encoding issues when harvesting xml ISO
> documents via the harvest/spatial harvest plugins.
>
> If we harvest an xml ISO document that has unicode characters in it, the
> unicode characters seem to get broken down into their base characters
> instead of being considered the original unicode character.
>
> Here is an example.
>
> If we have the following abstract in our iso xml:
> ° DEGREE SIGN \u00b0 ⁻ SUPERSCRIPT MINUS \u207b ³ SUPERSCRIPT THREE \u00b3
> µ MICRO SIGN \u00b5 ₂ SUBSCRIPT 2 \u2082
>
> We will see the following on our ckan page:
> Â° DEGREE SIGN \u00b0 â » SUPERSCRIPT MINUS \u207b Â³ SUPERSCRIPT THREE
> \u00b3 Âµ MICRO SIGN \u00b5 â‚‚ SUBSCRIPT 2 \u2082
>
> Which is not the desired outcome.
>
> It looks as though the ISO xml file is being interpreted as an ASCII or
> Latin-1 encoded file instead of utf-8.
>
> We can kind of prove this by looking at the characters (bytes) involved
> with making the example unicode characters.
>
> DEGREE SIGN gets broken downing to the following bytes:  c2 b0
>
> Then looking at the Latin-1 character set table:
> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>
> We can see the following:
> c2 = Uppercase a-circumflex
> b0 = Degree Symbol
>
>
> Breaking down the rest of the characters:
>
> Superscript Minus: e2 81 bb (Lowercase a-circumflex, ascii control
> character, Guillemet)
> Superscript Three: c2 b3 (Uppercase a-circumflex, Cube)
> Micro Sign: c2 b5 (Uppercase a-circumflex, Micro Sign)
> Subscript 2:  e2 82 82 (Lowercase a-circumflex, ascii control character,
> ascii control character)
>
>
> My guess is that the harvester code is assuming latin-1 (or similar)
> encoding and reading the file in that way instead of assuming utf-8.
>
> Here are a couple of very simple python code examples of what could be
> happening:
> >>> print '°'.decode("latin-1")
> Â°
> >>> '°'.decode("latin-1")
> u'\xc2\xb0'
> >>> '°'
> '\xc2\xb0'
>
>
> I have attached an example ISO xml file that shows the problem we are
> running into.
>
> Is this a bug with the harvest/spatial harvest code?  Or are we doing
> something wrong on our end?
>
> And friendly knowledge or advice would be greatly appreciated.
>
> Thank you for your time,
>
> Nathan
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20171026/4de0bc8b/attachment-0003.html>