[ckan-dev] Is CKAN suitable for textual search in a 10Gb dataset?

Andrés Martano andres at inventati.org
Wed Apr 9 12:36:57 UTC 2014


Thanks for the answers.

Em 08-04-2014 21:01, Vitor Baptista escreveu:
> Hey Andrés,
>
> You can read more about it at Pylon's documentation, or
> Route's https://routes.readthedocs.org/en/latest/.
>
> Cheers,
Thanks!


Em 09-04-2014 03:58, Dominik Moritz escreveu:
> I'm not sure what you mean by cross table. What I meant (and I
> probably didn't say that very well) is that the sql search is easier
> to write if it only spans one table. If you want to query multiple
> tables, you have to join or union the results.
I am not sure, but I think my case would use only one table.

> The datastore always uses postgres which is usually the database for
> storage in CKAN anyway. Solr is for search in the metadata. I am not
> sure how good it fits for search in large amounts of textdata. In any
> case you will need to write a significant amount of custom code. 
As far as I know Solr is much faster than Postgres for searches, but
maybe Postgres can handle this...


To make a test I tried to push one year of data, a 600Mb CSV, to the
DataStore, but got this error:
> Erro: [u' File
> "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py",
> line 512, in _run_job\n retval = job.func(*job.args, **job.kwargs)\n',
> u' File
> "/home/andres/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py",
> line 317, in push_to_datastore\n for i, records in
> enumerate(chunky(result, 250)):\n', u' File
> "/home/andres/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py",
> line 105, in chunky\n item = list(itertools.islice(it, n))\n', u' File
> "/home/andres/ckan/lib/datapusher/src/datapusher/datapusher/jobs.py",
> line 289, in row_iterator\n for row in row_set:\n', u' File
> "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/messytables/core.py",
> line 218, in __iter__\n for row in self.raw(sample=sample):\n', u'
> File
> "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/messytables/commas.py",
> line 171, in raw\n raise messytables.ReadError(\'Error reading CSV:
> %r\', err)\n', u"ReadError('Error reading CSV: %r', Error('field
> larger than field limit (256000)',))"] 
The CSV has ~160k lines with 6 columns. All columns have sort cells,
except the last one, where I placed the contents of the files. The last
column has cells up to 4Mb of text. Seems this is causing the error...
Is this incompatible with Postgres? Is there a way to adjust this limit?
Am I unable to use the DataStore?



More information about the ckan-dev mailing list