[ckan-dev] Is CKAN suitable for textual search in a 10Gb dataset?

Nigel Babu nigel.babu at okfn.org
Mon Apr 7 02:38:33 UTC 2014


Hi Andres,

Storing the data in CKAN for cataloging might be a good idea, but I'm
fairly sure that you will need to use something other than CKAN to do the
textual search of the content of the resources. You will need to do a bit
of customization, but it might be easier to use CKAN as the backend and
build additional services to the textual search.

Nigel Babu

Developer  |  @nigelbabu <https://twitter.com/nigelbabu>

The Open Knowledge Foundation <http://okfn.org/>

Empowering through Open Knowledge

http://okfn.org/  |  @okfn <http://twitter.com/OKFN>  |  OKF on
Facebook<https://www.facebook.com/OKFNetwork> |
Blog <http://blog.okfn.org/>  |  Newsletter<http://okfn.org/about/newsletter>

 CKAN | http://ckan.org/ | @CKANproject
<http://twitter.com/CKANproject> |the world’s leading open-source data
portal platform


On 21 March 2014 20:42, Andrés Martano <andres at inventati.org> wrote:

>  Hello, everybody!
>
> As a part of my master's degree project I will be helping one of the
> biggest cities in Brazil to open a few datasets.
> CKAN seems pretty good to allow the download of datasets as a whole. But
> in our case, we want to allow the citizen to do a textual search inside the
> dataset using some advanced search features (like date or another
> meta-data).
> We would like to support some URIs and RDF too.
>
> To be more clear I will give some details about one of the datasets:
>
> I don't know in other countries, but in Brazil we have a special kind of
> newspaper or gazette were the public administration publishes anything that
> must become legal (like laws or public contracts). This dataset consists of
> all these articles ordered by date and with some meta-data about what is
> written in the article and which public department is publishing it.
> *In other words, thousands of small TXT files summing about 10Gb, and we
> need to allow textual search in all that.*
> I saw in the docs that CKAN comes with Solr, but is there a way to use it
> to search inside a dataset? Or is it used only to search between datasets.
> *To use CKAN in these case I thought about adding each article as a
> dataset and grouping them by day using meta-datasets.* *Is this a
> reasonable solution?* *Will it search inside each dataset** (article)?**
> If not, what if I add the text of the article as a meta-data of each
> article? Will it double the used space? **Will there be a form where the
> citizen can chose a data-range, some categories and then do a textual
> search? After the search, will it be able to export (download) the filtered
> results (zipped, for example)?*
>
> I think that adding each article as a dataset makes it easy to have an URI
> for each of them, right? What about an URI for the real article itself, not
> the document (I mean that link that returns 303 and forwards to the
> document), is it possible in CKAN too? And doing listing based on the URI
> (like www.mysite.org/articles/2010/02/ will return all articles published
> in February of 2010)?
>
>
> I code in Python and develop sites using Pyramid, so I would like to know
> what would be best in this case: develop something from zero, or customize
> CKAN. I would ratter use a such nice open project like CKAN, but, if the
> purpose of the project is too different from my needs, maybe I should do it
> from zero...
> *What do you say? Is CKAN suitable for my needs? Which extensions would
> you recommend? What would I have to implement by myself in CKAN?*
>
>
> Best regards and thanks for the attention.
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20140407/3e13d930/attachment-0002.html>


More information about the ckan-dev mailing list