[ckan-discuss] Is CKAN suitable for textual search in a 10Gb dataset?

Andrés Martano andres at inventati.org
Wed Apr 2 12:35:41 UTC 2014


So... is there a simple way in CKAN to let the common citizen search
inside a 10Gb dataset like that?
If not, would it be reasonable to implement it as an extension?

Would it be reasonable to map each TXT article (~10Kb) as a dataset?
That would mean millions of datasets...
Or would it be better to let them inside a ZIP, maybe grouped by month
or year, and let CKAN search inside them?

Does it has any resource to export filtered results?

Em 25-03-2014 16:28, Andrés Martano escreveu:
> Thanks for the answers, Hanssens and Rufus.
>
> Didn't knew about DataTank. Seems a very interesting project, and
> quite beautiful too. ;)
> But checking the demo site, I still think it's more focused on the
> "computer expert", the one who wants a simple visualization just to
> take a look in the data, and them download it or deal with it via an API.
>
>> I think it would be worth spelling out a bit more what your exact
>> needs are (user stories).
> Sure. We're thinking about 3 types of users:
> - The "computer expert" that will be happy to download the entire DB
> so he can reuse it in another program. CKAN solves it out of the box.
> - The "academic" that wants the data in RDF to do more complex stuff.
> Generally he can download the whole DB too, so we could solve this by
> generating an RDF version of the DB and posting it in CKAN too.
> - The "common citizen", that isn't interested in the whole DB, and
> wouldn't be able to deal with a 10Gb DB. He doesn't even knows what
> "CSV", "JSON", "DB" or "10Gb" mean.
>
> The third user is the one that I am still confused if I can help him
> with a tool like CKAN, or if I'll have to code one myself.
> This user wants an interface like this:
> http://query.nytimes.com/search/sitesearch/#/cats/
> Where it's possible to enter some text, pick some categories and a
> date period, do a textual search and browse the results. He'll be
> looking for an specific article or group of articles. Most of time
> he'll be happy to read the articles online, but sometimes he'll need
> to export the results as a ZIP with TXTs.
>
> Since the third user needs to access the articles individually, I
> thought about using "cool" URLs for each one. Not that the common
> citizen care about it, but the RDF guy does.
>
>> Is the data in the txt files structured or unstructured (ie. do you
>> want raw full text search or will you be able to extract specific fields)
> Raw search would do it.
>
>> If not, 10 GB is pretty small according to today's standard, depending on what your requirements are, 
>> even a simple command line tool like grep could do the trick.
> You have a point. Maybe Whoosh
> <https://bitbucket.org/mchaput/whoosh/wiki/Home> can solve it too,
> since it's pure Python, reducing integration costs.

Em 25-03-2014 13:34, Rufus Pollock escreveu:
> On Tuesday, 25 March 2014, Andrés Martano <andres at inventati.org
> <mailto:andres at inventati.org>> wrote:
>
>     Hello, everybody!
>
>     I sent this message to the dev list a few days ago, but got no
>     reply, so I will try here:
>
>
>     As a part of my master's degree project I will be helping one of
>     the biggest cities in Brazil to open a few datasets.
>     CKAN seems pretty good to allow the download of datasets as a
>     whole. But in our case, we want to allow the citizen to do a
>     textual search inside the dataset using some advanced search
>     features (like date or another meta-data).
>     We would like to support some URIs and RDF too.
>
>     To be more clear I will give some details about one of the datasets:
>
>     I don't know in other countries, but in Brazil we have a special
>     kind of newspaper or gazette were the public administration
>     publishes anything that must become legal (like laws or public
>     contracts). This dataset consists of all these articles ordered by
>     date and with some meta-data about what is written in the article
>     and which public department is publishing it.
>     *In other words, thousands of small TXT files summing about 10Gb,
>     and we need to allow textual search in all that.*
>
>
> Is the data in the txt files structured or unstructured (ie. do you
> want raw full text search or will you be able to extract specific fields)
>  
>
>     I saw in the docs that CKAN comes with Solr, but is there a way to
>     use it to search inside a dataset? Or is it used only to search
>     between datasets.
>
>
> SOLR is used at the moment to search metadata. If you're storing
> "data" (i.e. the raw text from the text files) that goes into the
> DataStore which is postgres based. 
>  
>
>     *To use CKAN in these case I thought about adding each article as
>     a dataset and grouping them by day using meta-datasets.****Is this
>     a reasonable solution?****Will it search inside each
>     dataset**(article)?**If not, what if I add the text of the article
>     as a meta-data of each article? Will it double the used space?
>     **Will there be a form where the citizen can chose a data-range,
>     some categories and then do a textual search? After the search,
>     will it be able to export (download) the filtered results (zipped,
>     for example)?*
>
>
> I think my initial question would be what would be the *perfect*
> schema you would have if you could have anything you wanted? Also what
> exactly is the use case you are envisaging - will it be specific types
> of people searching this (and what are they looking for), or do you
> want a general browsable interface to the gazette?
>  
>
>     I think that adding each article as a dataset makes it easy to
>     have an URI for each of them, right? What about an URI for the
>     real article itself, not the document (I mean that link that
>     returns 303 and forwards to the document), is it possible in CKAN
>     too? And doing listing based on the URI (like
>     www.mysite.org/articles/2010/02/
>     <http://www.mysite.org/articles/2010/02/> will return all articles
>     published in February of 2010)?
>
>
> If you really want your own custom URLs etc it may be better to build
> one's own web-app using a framework like Flask (if you lik python).
>  
>
>     I code in Python and develop sites using Pyramid, so I would like
>     to know what would be best in this case: develop something from
>     zero, or customize CKAN. I would ratter use a such nice open
>     project like CKAN, but, if the purpose of the project is too
>     different from my needs, maybe I should do it from zero...
>
>
> I think it would be worth spelling out a bit more what your exact
> needs are (user stories). CKAN could prove a good way to built a quick
> proof of concept even if you ultimately need to move to a
> "roll-your-own" model for the next iteration.
>
> Rufus
>  
>
>     *What do you say? Is CKAN suitable for my needs? Which extensions
>     would you recommend? What would I have to implement by myself in
>     CKAN?*
>
>
>     Best regards and thanks for the attention.
>
>
>
> -- 
> *
>
> **Rufus Pollock**
>
> **Founder and CEO | skype: rufuspollock | @rufuspollock
> <https://twitter.com/rufuspollock>**
>
> **The Open Knowledge Foundation <http://okfn.org/>**
>
> **Empowering through Open Knowledge**
>
> **http://okfn.org/| @okfn <http://twitter.com/OKFN>| OKF on Facebook
> <https://www.facebook.com/OKFNetwork>|  Blog <http://blog.okfn.org/> |
>  Newsletter <http://okfn.org/about/newsletter>*
> **
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/ckan-discuss/attachments/20140402/42e6b6a2/attachment.html>


More information about the ckan-discuss mailing list