[ckan-discuss] Is CKAN suitable for textual search in a 10Gb dataset?

Rufus Pollock rufus.pollock at okfn.org
Tue Mar 25 16:29:08 UTC 2014


On Tuesday, 25 March 2014, Pieter Colpaert <pieter.colpaert at okfn.org> wrote:

>  Hi Andrés,
>
> CKAN only takes care of the meta-data of datasets, not the data itself.
> You can however build extensions to link CKAN with other systems.
>

Just jumping in to correct this misapprehension. CKAN takes care of
*data*as well as metadata and has done for years :-)

In particular, both the FileStore ("blob-storage") and the DataStore
(structured, searchable data storage) have been around for many CKAN
versions and are now fairly mature. In particular, the DataStore is backed
onto Postgresql and has a full data API and search support.

Rufus


> Your problem seems something than would be able to be solved with a full
> text indexer like SOLR or elasticsearch. The fact that you also want to do
> SPARQL querying, I would look into 2 options which you will have to google:
>  * Lucene with Jenna
>  * Virtuoso and the bif_contains
>
> Hope this helps!
>
> Pieter
>
> On 2014-03-25 13:55, Andrés Martano wrote:
>
> Hello, everybody!
>
> I sent this message to the dev list a few days ago, but got no reply, so I
> will try here:
>
>
> As a part of my master's degree project I will be helping one of the
> biggest cities in Brazil to open a few datasets.
> CKAN seems pretty good to allow the download of datasets as a whole. But
> in our case, we want to allow the citizen to do a textual search inside the
> dataset using some advanced search features (like date or another
> meta-data).
> We would like to support some URIs and RDF too.
>
> To be more clear I will give some details about one of the datasets:
>
> I don't know in other countries, but in Brazil we have a special kind of
> newspaper or gazette were the public administration publishes anything that
> must become legal (like laws or public contracts). This dataset consists of
> all these articles ordered by date and with some meta-data about what is
> written in the article and which public department is publishing it.
> *In other words, thousands of small TXT files summing about 10Gb, and we
> need to allow textual search in all that.*
> I saw in the docs that CKAN comes with Solr, but is there a way to use it
> to search inside a dataset? Or is it used only to search between datasets.
> *To use CKAN in these case I thought about adding each article as a
> dataset and grouping them by day using meta-datasets.* *Is this a
> reasonable solution?* *Will it search inside each dataset** (article)?**
> If not, what if I add the text of the article as a meta-data of each
> article? Will it double the used space? **Will there be a form where the
> citizen can chose a data-range, some categories and then do a textual
> search? After the search, will it be able to export (download) the filtered
> results (zipped, for example)?*
>
> I think that adding each article as a dataset makes it easy to have an URI
> for each of them, right? What about an URI for the real article itself, not
> the document (I mean that link that returns 303 and forwards to the
> document), is it possible in CKAN too? And doing listing based on the URI
> (like www.mysite.org/articles/2010/02/ will return all articles published
> in February of 2010)?
>
>
> I code in Python and develop sites using Pyramid, so I would like to know
> what would be best in this case: develop something from zero, or customize
> CKAN. I would ratter use a such nice open project like CKAN, but, if the
> purpose of the project is too different from my needs, maybe I should do it
> from zero...
> *What do you say? Is CKAN suitable for my needs? Which extensions would
> you recommend? What would I have to implement by myself in CKAN?*
>
>
> Best regards and thanks for the attention.
>
>
> _______________________________________________
> ckan-discuss mailing listckan-discuss at lists.okfn.org <javascript:_e(%7B%7D,'cvml','ckan-discuss at lists.okfn.org');>https://lists.okfn.org/mailman/listinfo/ckan-discuss
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-discuss
>
>
>
> --
>
> +32 486 74 71 22
>
> Open Knowledge Foundation Belgiumhttp://okfn.be
>
> Open Transport Working Group OKFNhttp://transport.okfn.org
>
>

-- 


*Rufus PollockFounder and CEO | skype: rufuspollock | @rufuspollock
<https://twitter.com/rufuspollock>The Open Knowledge Foundation
<http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
<http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF on Facebook
<https://www.facebook.com/OKFNetwork> |  Blog <http://blog.okfn.org/>  |
 Newsletter <http://okfn.org/about/newsletter>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/ckan-discuss/attachments/20140325/26869961/attachment-0001.html>


More information about the ckan-discuss mailing list