[ckan-dev] Is CKAN suitable for textual search in a 10Gb dataset?

Andrés Martano andres at inventati.org
Tue Apr 22 00:08:43 UTC 2014


To continue with the quest:

After trying to open the CSV with Python's CSV lib, I've found out that
there was a NULL char inside it. That seems to be crashing the
Datapusher. Many of the original TXTs used exotic encodings. Had to try
to identify them and purge "ugly" chars (there was a horde of them, what
could be crashing the Datapusher too).

After cleaning the CSV, tried to upload and push it again. Got an Error
500 inside CKAN's Datapusher interface. Checking the logs there was a
problem with Postgres' tsvector (used for textual searches) max sizes, 1Mb.
The line in the CSV causing the error had a 3Mb text field. The 1Mb of
tsvector generally can index more then 10Mb of text. But this text field
has a list of names and numbers, with few repetitions, needing a bigger
tsvector.
So, I had to add extra code to break big text fields in more rows and
add an extra column to tell which rows compose an article.

I am pushing the CSV again, now with no field bigger than 1Mb (I hope)
and via the API.
Then, if it works, I'll try the textual search.

Still no examples of SQL textual search via the API?



More information about the ckan-dev mailing list