[ckan-discuss] Is CKAN suitable for textual search in a 10Gb dataset?
Pieter Colpaert
pieter.colpaert at okfn.org
Tue Mar 25 12:58:35 UTC 2014
Hi Andrés,
CKAN only takes care of the meta-data of datasets, not the data itself.
You can however build extensions to link CKAN with other systems.
Your problem seems something than would be able to be solved with a full
text indexer like SOLR or elasticsearch. The fact that you also want to
do SPARQL querying, I would look into 2 options which you will have to
google:
* Lucene with Jenna
* Virtuoso and the bif_contains
Hope this helps!
Pieter
On 2014-03-25 13:55, Andrés Martano wrote:
> Hello, everybody!
>
> I sent this message to the dev list a few days ago, but got no reply,
> so I will try here:
>
>
> As a part of my master's degree project I will be helping one of the
> biggest cities in Brazil to open a few datasets.
> CKAN seems pretty good to allow the download of datasets as a whole.
> But in our case, we want to allow the citizen to do a textual search
> inside the dataset using some advanced search features (like date or
> another meta-data).
> We would like to support some URIs and RDF too.
>
> To be more clear I will give some details about one of the datasets:
>
> I don't know in other countries, but in Brazil we have a special kind
> of newspaper or gazette were the public administration publishes
> anything that must become legal (like laws or public contracts). This
> dataset consists of all these articles ordered by date and with some
> meta-data about what is written in the article and which public
> department is publishing it.
> *In other words, thousands of small TXT files summing about 10Gb, and
> we need to allow textual search in all that.*
> I saw in the docs that CKAN comes with Solr, but is there a way to use
> it to search inside a dataset? Or is it used only to search between
> datasets.
> *To use CKAN in these case I thought about adding each article as a
> dataset and grouping them by day using meta-datasets.****Is this a
> reasonable solution?****Will it search inside each
> dataset**(article)?**If not, what if I add the text of the article as
> a meta-data of each article? Will it double the used space? **Will
> there be a form where the citizen can chose a data-range, some
> categories and then do a textual search? After the search, will it be
> able to export (download) the filtered results (zipped, for example)?*
>
> I think that adding each article as a dataset makes it easy to have an
> URI for each of them, right? What about an URI for the real article
> itself, not the document (I mean that link that returns 303 and
> forwards to the document), is it possible in CKAN too? And doing
> listing based on the URI (like www.mysite.org/articles/2010/02/ will
> return all articles published in February of 2010)?
>
>
> I code in Python and develop sites using Pyramid, so I would like to
> know what would be best in this case: develop something from zero, or
> customize CKAN. I would ratter use a such nice open project like CKAN,
> but, if the purpose of the project is too different from my needs,
> maybe I should do it from zero...
> *What do you say? Is CKAN suitable for my needs? Which extensions
> would you recommend? What would I have to implement by myself in CKAN?*
>
>
> Best regards and thanks for the attention.
>
>
> _______________________________________________
> ckan-discuss mailing list
> ckan-discuss at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-discuss
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-discuss
--
+32 486 74 71 22
Open Knowledge Foundation Belgium
http://okfn.be
Open Transport Working Group OKFN
http://transport.okfn.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/ckan-discuss/attachments/20140325/781ca1f6/attachment.html>
More information about the ckan-discuss
mailing list