[ckan-dev] Harvest/Spatial Harvest Unicode Character Issues

Tue Oct 10 20:40:38 UTC 2017

Good Day,

We are experiencing unicode encoding issues when harvesting xml ISO
documents via the harvest/spatial harvest plugins.

If we harvest an xml ISO document that has unicode characters in it, the
unicode characters seem to get broken down into their base characters
instead of being considered the original unicode character.

Here is an example.

If we have the following abstract in our iso xml:
° DEGREE SIGN \u00b0 ⁻ SUPERSCRIPT MINUS \u207b ³ SUPERSCRIPT THREE \u00b3
µ MICRO SIGN \u00b5 ₂ SUBSCRIPT 2 \u2082

We will see the following on our ckan page:
Â° DEGREE SIGN \u00b0 â » SUPERSCRIPT MINUS \u207b Â³ SUPERSCRIPT THREE
\u00b3 Âµ MICRO SIGN \u00b5 â‚‚ SUBSCRIPT 2 \u2082

Which is not the desired outcome.

It looks as though the ISO xml file is being interpreted as an ASCII or
Latin-1 encoded file instead of utf-8.

We can kind of prove this by looking at the characters (bytes) involved
with making the example unicode characters.

DEGREE SIGN gets broken downing to the following bytes:  c2 b0

Then looking at the Latin-1 character set table:
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

We can see the following:
c2 = Uppercase a-circumflex
b0 = Degree Symbol

Breaking down the rest of the characters:

Superscript Minus: e2 81 bb (Lowercase a-circumflex, ascii control
character, Guillemet)
Superscript Three: c2 b3 (Uppercase a-circumflex, Cube)
Micro Sign: c2 b5 (Uppercase a-circumflex, Micro Sign)
Subscript 2:  e2 82 82 (Lowercase a-circumflex, ascii control character,
ascii control character)

My guess is that the harvester code is assuming latin-1 (or similar)
encoding and reading the file in that way instead of assuming utf-8.

Here are a couple of very simple python code examples of what could be
happening:
>>> print '°'.decode("latin-1")
Â°
>>> '°'.decode("latin-1")
u'\xc2\xb0'
>>> '°'
'\xc2\xb0'

I have attached an example ISO xml file that shows the problem we are
running into.

Is this a bug with the harvest/spatial harvest code?  Or are we doing
something wrong on our end?

And friendly knowledge or advice would be greatly appreciated.

Thank you for your time,

Nathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20171010/27b6a432/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Unicode_Character_Test_with_No_CDATA.xml
Type: text/xml
Size: 11474 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20171010/27b6a432/attachment-0002.xml>