[ckan-dev] datastore - large queries

Alice Heaton a.heaton at nhm.ac.uk
Fri Mar 6 10:41:11 UTC 2015


Hello,

There is no setting in CKAN (to my knowledge) to help with this. Things 
we have done here:

- Use nginx to limit the request rate (total and from a single IP) to 
the datastore api;
- Ensure our servers can deal with as many requests as we allow (so 
worst case the site is blocked, but the servers won't go down);
- Clear the response after each request. At the moment this doesn't 
happen, so a worker keeps the last response in memory until it used 
again. With 20 workers returning very large responses, this can kill 
your memory very quickly. You'll need to merge this PR to do this: 
https://github.com/ckan/ckan/pull/2262

I have been thinking of imposing a hard limit on number of rows returned 
per request. This could be implemented as a middleware which simply 
returns a 400 error if there is a limit higher than a configured number. 
This is not a priority for us, but we might come round to doing it at 
some point. If you implement something like this we will definitely help 
with testing!

Another approach currently under discussion is to have a 
streaming/chunked API, so the response is only build as the client 
consumes it. This is only at discussion phase, but worth keeping an eye 
on: https://github.com/ckan/ideas-and-roadmap/issues/128

Alice

On 06/03/15 00:32, Alex Gartner wrote:
> Hi everyone,
>
> I have a question related to the datastore API being used by a user 
> with *bad intentions* to achieve a denial of service of some kind. 
> Since the project that I'm working on plans to have datastore tables 
> with around 1 million rows I'm thinking this might be used against the 
> system.
> To give a few examples:
>
>   * The following request to the "datastore_search_sql" endpoint takes
>      on my laptop around 2 minutes to complete for a datastore_table
>     with 2 500 rows and limit 250 000. Without the limit I imagine it
>     would take around 50 mins (There would be 2 500 x 2 500 rows in
>     the response).
>       o curl -G localhost:5000/api/action/datastore_search_sql
>         --data-urlencode "sql=SELECT a.* from datastore_table a,
>         datastore_table b limit 250000"
>   * accessing the "datastore_search" endpoint with a limit of 250 000
>     takes also about 2 mins ( for a table of around 500 000 rows )
>       o curl
>         "http://localhost/api/action/datastore_search?resource_id=resource_id&limit=250000"
>
> I imagine that somebody hitting the datastore API with X simultaneous 
> requests for data with a limit of 1 million could block the server 
> (while using all the db connections).
>
> Is there a way to set a hard limit that cannot be overwritten by the 
> user for the number of results returned by a query to the datastore 
> (to force pagination in way) ?
>
> And in more general terms, what would be the best practice for 
> avoiding such issues ? Are there some CKAN settings that help with 
> this ? Should we setup the web server ( nginx, apache ) to use a 
> harsher limit on the number of simultaneous HTTP requests to the 
> datastore API endpoints ?
>
> Thanks for the help,
> Alex
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150306/dbe60dfc/attachment-0003.html>


More information about the ckan-dev mailing list