[ckan-discuss] Is CKAN suitable for textual search in a 10Gb dataset?

Pieter Colpaert pieter.colpaert at okfn.org
Tue Mar 25 12:58:35 UTC 2014


Hi Andrés,

CKAN only takes care of the meta-data of datasets, not the data itself. 
You can however build extensions to link CKAN with other systems.

Your problem seems something than would be able to be solved with a full 
text indexer like SOLR or elasticsearch. The fact that you also want to 
do SPARQL querying, I would look into 2 options which you will have to 
google:
  * Lucene with Jenna
  * Virtuoso and the bif_contains

Hope this helps!

Pieter

On 2014-03-25 13:55, Andrés Martano wrote:
> Hello, everybody!
>
> I sent this message to the dev list a few days ago, but got no reply, 
> so I will try here:
>
>
> As a part of my master's degree project I will be helping one of the 
> biggest cities in Brazil to open a few datasets.
> CKAN seems pretty good to allow the download of datasets as a whole. 
> But in our case, we want to allow the citizen to do a textual search 
> inside the dataset using some advanced search features (like date or 
> another meta-data).
> We would like to support some URIs and RDF too.
>
> To be more clear I will give some details about one of the datasets:
>
> I don't know in other countries, but in Brazil we have a special kind 
> of newspaper or gazette were the public administration publishes 
> anything that must become legal (like laws or public contracts). This 
> dataset consists of all these articles ordered by date and with some 
> meta-data about what is written in the article and which public 
> department is publishing it.
> *In other words, thousands of small TXT files summing about 10Gb, and 
> we need to allow textual search in all that.*
> I saw in the docs that CKAN comes with Solr, but is there a way to use 
> it to search inside a dataset? Or is it used only to search between 
> datasets.
> *To use CKAN in these case I thought about adding each article as a 
> dataset and grouping them by day using meta-datasets.****Is this a 
> reasonable solution?****Will it search inside each 
> dataset**(article)?**If not, what if I add the text of the article as 
> a meta-data of each article? Will it double the used space? **Will 
> there be a form where the citizen can chose a data-range, some 
> categories and then do a textual search? After the search, will it be 
> able to export (download) the filtered results (zipped, for example)?*
>
> I think that adding each article as a dataset makes it easy to have an 
> URI for each of them, right? What about an URI for the real article 
> itself, not the document (I mean that link that returns 303 and 
> forwards to the document), is it possible in CKAN too? And doing 
> listing based on the URI (like www.mysite.org/articles/2010/02/ will 
> return all articles published in February of 2010)?
>
>
> I code in Python and develop sites using Pyramid, so I would like to 
> know what would be best in this case: develop something from zero, or 
> customize CKAN. I would ratter use a such nice open project like CKAN, 
> but, if the purpose of the project is too different from my needs, 
> maybe I should do it from zero...
> *What do you say? Is CKAN suitable for my needs? Which extensions 
> would you recommend? What would I have to implement by myself in CKAN?*
>
>
> Best regards and thanks for the attention.
>
>
> _______________________________________________
> ckan-discuss mailing list
> ckan-discuss at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-discuss
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-discuss


-- 

+32 486 74 71 22

Open Knowledge Foundation Belgium
http://okfn.be

Open Transport Working Group OKFN
http://transport.okfn.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/ckan-discuss/attachments/20140325/781ca1f6/attachment.html>


More information about the ckan-discuss mailing list