[ckan-discuss] Is CKAN suitable for textual search in a 10Gb dataset?

Rufus Pollock rufus.pollock at okfn.org
Tue Mar 25 16:34:21 UTC 2014

On Tuesday, 25 March 2014, Andrés Martano <andres at inventati.org> wrote:

>  Hello, everybody!
> I sent this message to the dev list a few days ago, but got no reply, so I
> will try here:
> As a part of my master's degree project I will be helping one of the
> biggest cities in Brazil to open a few datasets.
> CKAN seems pretty good to allow the download of datasets as a whole. But
> in our case, we want to allow the citizen to do a textual search inside the
> dataset using some advanced search features (like date or another
> meta-data).
> We would like to support some URIs and RDF too.
> To be more clear I will give some details about one of the datasets:
> I don't know in other countries, but in Brazil we have a special kind of
> newspaper or gazette were the public administration publishes anything that
> must become legal (like laws or public contracts). This dataset consists of
> all these articles ordered by date and with some meta-data about what is
> written in the article and which public department is publishing it.
> *In other words, thousands of small TXT files summing about 10Gb, and we
> need to allow textual search in all that.*

Is the data in the txt files structured or unstructured (ie. do you want
raw full text search or will you be able to extract specific fields)

> I saw in the docs that CKAN comes with Solr, but is there a way to use it
> to search inside a dataset? Or is it used only to search between datasets.

SOLR is used at the moment to search metadata. If you're storing "data"
(i.e. the raw text from the text files) that goes into the DataStore which
is postgres based.

> *To use CKAN in these case I thought about adding each article as a
> dataset and grouping them by day using meta-datasets.* *Is this a
> reasonable solution?* *Will it search inside each dataset** (article)?**
> If not, what if I add the text of the article as a meta-data of each
> article? Will it double the used space? **Will there be a form where the
> citizen can chose a data-range, some categories and then do a textual
> search? After the search, will it be able to export (download) the filtered
> results (zipped, for example)?*

I think my initial question would be what would be the *perfect* schema you
would have if you could have anything you wanted? Also what exactly is the
use case you are envisaging - will it be specific types of people searching
this (and what are they looking for), or do you want a general browsable
interface to the gazette?

> I think that adding each article as a dataset makes it easy to have an URI
> for each of them, right? What about an URI for the real article itself, not
> the document (I mean that link that returns 303 and forwards to the
> document), is it possible in CKAN too? And doing listing based on the URI
> (like www.mysite.org/articles/2010/02/ will return all articles published
> in February of 2010)?

If you really want your own custom URLs etc it may be better to build one's
own web-app using a framework like Flask (if you lik python).

> I code in Python and develop sites using Pyramid, so I would like to know
> what would be best in this case: develop something from zero, or customize
> CKAN. I would ratter use a such nice open project like CKAN, but, if the
> purpose of the project is too different from my needs, maybe I should do it
> from zero...

I think it would be worth spelling out a bit more what your exact needs are
(user stories). CKAN could prove a good way to built a quick proof of
concept even if you ultimately need to move to a "roll-your-own" model for
the next iteration.


> *What do you say? Is CKAN suitable for my needs? Which extensions would
> you recommend? What would I have to implement by myself in CKAN?*
> Best regards and thanks for the attention.


*Rufus PollockFounder and CEO | skype: rufuspollock | @rufuspollock
<https://twitter.com/rufuspollock>The Open Knowledge Foundation
<http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
<http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF on Facebook
<https://www.facebook.com/OKFNetwork> |  Blog <http://blog.okfn.org/>  |
 Newsletter <http://okfn.org/about/newsletter>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/ckan-discuss/attachments/20140325/67988913/attachment.html>

More information about the ckan-discuss mailing list