[ckan-discuss] Is CKAN suitable for textual search in a 10Gb dataset?

Tue Mar 25 19:28:24 UTC 2014

Thanks for the answers, Hanssens and Rufus.

Didn't knew about DataTank. Seems a very interesting project, and quite
beautiful too. ;)
But checking the demo site, I still think it's more focused on the
"computer expert", the one who wants a simple visualization just to take
a look in the data, and them download it or deal with it via an API.

> I think it would be worth spelling out a bit more what your exact
> needs are (user stories).
Sure. We're thinking about 3 types of users:
- The "computer expert" that will be happy to download the entire DB so
he can reuse it in another program. CKAN solves it out of the box.
- The "academic" that wants the data in RDF to do more complex stuff.
Generally he can download the whole DB too, so we could solve this by
generating an RDF version of the DB and posting it in CKAN too.
- The "common citizen", that isn't interested in the whole DB, and
wouldn't be able to deal with a 10Gb DB. He doesn't even knows what
"CSV", "JSON", "DB" or "10Gb" mean.

The third user is the one that I am still confused if I can help him
with a tool like CKAN, or if I'll have to code one myself.
This user wants an interface like this:
http://query.nytimes.com/search/sitesearch/#/cats/
Where it's possible to enter some text, pick some categories and a date
period, do a textual search and browse the results. He'll be looking for
an specific article or group of articles. Most of time he'll be happy to
read the articles online, but sometimes he'll need to export the results
as a ZIP with TXTs.

Since the third user needs to access the articles individually, I
thought about using "cool" URLs for each one. Not that the common
citizen care about it, but the RDF guy does.

> Is the data in the txt files structured or unstructured (ie. do you
> want raw full text search or will you be able to extract specific fields)
Raw search would do it.

> If not, 10 GB is pretty small according to today's standard, depending on what your requirements are, 
> even a simple command line tool like grep could do the trick.
You have a point. Maybe Whoosh
<https://bitbucket.org/mchaput/whoosh/wiki/Home> can solve it too, since
it's pure Python, reducing integration costs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/ckan-discuss/attachments/20140325/5c5359a8/attachment.html>