[ckan-dev] Is CKAN suitable for textual search in a 10Gb dataset?

Andrés Martano andres at inventati.org
Fri Mar 21 15:12:09 UTC 2014


Hello, everybody!

As a part of my master's degree project I will be helping one of the
biggest cities in Brazil to open a few datasets.
CKAN seems pretty good to allow the download of datasets as a whole. But
in our case, we want to allow the citizen to do a textual search inside
the dataset using some advanced search features (like date or another
meta-data).
We would like to support some URIs and RDF too.

To be more clear I will give some details about one of the datasets:

I don't know in other countries, but in Brazil we have a special kind of
newspaper or gazette were the public administration publishes anything
that must become legal (like laws or public contracts). This dataset
consists of all these articles ordered by date and with some meta-data
about what is written in the article and which public department is
publishing it.
*In other words, thousands of small TXT files summing about 10Gb, and we
need to allow textual search in all that.*
I saw in the docs that CKAN comes with Solr, but is there a way to use
it to search inside a dataset? Or is it used only to search between
datasets.
*To use CKAN in these case I thought about adding each article as a
dataset and grouping them by day using meta-datasets.****Is this a
reasonable solution?****Will it search inside each
dataset**(article)?**If not, what if I add the text of the article as a
meta-data of each article? Will it double the used space? **Will there
be a form where the citizen can chose a data-range, some categories and
then do a textual search? After the search, will it be able to export
(download) the filtered results (zipped, for example)?*

I think that adding each article as a dataset makes it easy to have an
URI for each of them, right? What about an URI for the real article
itself, not the document (I mean that link that returns 303 and forwards
to the document), is it possible in CKAN too? And doing listing based on
the URI (like www.mysite.org/articles/2010/02/ will return all articles
published in February of 2010)?


I code in Python and develop sites using Pyramid, so I would like to
know what would be best in this case: develop something from zero, or
customize CKAN. I would ratter use a such nice open project like CKAN,
but, if the purpose of the project is too different from my needs, maybe
I should do it from zero...
*What do you say? Is CKAN suitable for my needs? Which extensions would
you recommend? What would I have to implement by myself in CKAN?*


Best regards and thanks for the attention.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20140321/eb6e4a24/attachment-0002.html>


More information about the ckan-dev mailing list