[ckan-discuss] Is CKAN suitable for textual search in a 10Gb dataset?

Rufus Pollock rufus.pollock at okfn.org
Tue Mar 25 16:34:21 UTC 2014


On Tuesday, 25 March 2014, Andrés Martano <andres at inventati.org> wrote:

>  Hello, everybody!
>
> I sent this message to the dev list a few days ago, but got no reply, so I
> will try here:
>
>
> As a part of my master's degree project I will be helping one of the
> biggest cities in Brazil to open a few datasets.
> CKAN seems pretty good to allow the download of datasets as a whole. But
> in our case, we want to allow the citizen to do a textual search inside the
> dataset using some advanced search features (like date or another
> meta-data).
> We would like to support some URIs and RDF too.
>
> To be more clear I will give some details about one of the datasets:
>
> I don't know in other countries, but in Brazil we have a special kind of
> newspaper or gazette were the public administration publishes anything that
> must become legal (like laws or public contracts). This dataset consists of
> all these articles ordered by date and with some meta-data about what is
> written in the article and which public department is publishing it.
> *In other words, thousands of small TXT files summing about 10Gb, and we
> need to allow textual search in all that.*
>

Is the data in the txt files structured or unstructured (ie. do you want
raw full text search or will you be able to extract specific fields)


> I saw in the docs that CKAN comes with Solr, but is there a way to use it
> to search inside a dataset? Or is it used only to search between datasets.
>

SOLR is used at the moment to search metadata. If you're storing "data"
(i.e. the raw text from the text files) that goes into the DataStore which
is postgres based.


> *To use CKAN in these case I thought about adding each article as a
> dataset and grouping them by day using meta-datasets.* *Is this a
> reasonable solution?* *Will it search inside each dataset** (article)?**
> If not, what if I add the text of the article as a meta-data of each
> article? Will it double the used space? **Will there be a form where the
> citizen can chose a data-range, some categories and then do a textual
> search? After the search, will it be able to export (download) the filtered
> results (zipped, for example)?*
>

I think my initial question would be what would be the *perfect* schema you
would have if you could have anything you wanted? Also what exactly is the
use case you are envisaging - will it be specific types of people searching
this (and what are they looking for), or do you want a general browsable
interface to the gazette?


> I think that adding each article as a dataset makes it easy to have an URI
> for each of them, right? What about an URI for the real article itself, not
> the document (I mean that link that returns 303 and forwards to the
> document), is it possible in CKAN too? And doing listing based on the URI
> (like www.mysite.org/articles/2010/02/ will return all articles published
> in February of 2010)?
>

If you really want your own custom URLs etc it may be better to build one's
own web-app using a framework like Flask (if you lik python).


> I code in Python and develop sites using Pyramid, so I would like to know
> what would be best in this case: develop something from zero, or customize
> CKAN. I would ratter use a such nice open project like CKAN, but, if the
> purpose of the project is too different from my needs, maybe I should do it
> from zero...
>

I think it would be worth spelling out a bit more what your exact needs are
(user stories). CKAN could prove a good way to built a quick proof of
concept even if you ultimately need to move to a "roll-your-own" model for
the next iteration.

Rufus


> *What do you say? Is CKAN suitable for my needs? Which extensions would
> you recommend? What would I have to implement by myself in CKAN?*
>
>
> Best regards and thanks for the attention.
>


-- 


*Rufus PollockFounder and CEO | skype: rufuspollock | @rufuspollock
<https://twitter.com/rufuspollock>The Open Knowledge Foundation
<http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
<http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF on Facebook
<https://www.facebook.com/OKFNetwork> |  Blog <http://blog.okfn.org/>  |
 Newsletter <http://okfn.org/about/newsletter>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/ckan-discuss/attachments/20140325/67988913/attachment.html>


More information about the ckan-discuss mailing list