[ckan-dev] datastore - large queries

Alice Heaton a.heaton at nhm.ac.uk
Fri Mar 20 12:52:42 UTC 2015


Hi,

On 09/03/15 11:30, Alex Gartner wrote:
> Hi,
>
> thank you for the response.
> I think that in the short-term we'll follow your advises about 
> limiting requests from nginx and merging the PR. If we decide to go 
> beyond that for a longer-term solution will let you know.
>

Something else I just noticed going through our logs: you might want to 
block external access to /datastore/dump/<resource id> for large 
resources as this attempts to generate a CSV of the whole resource. On 
our 2.7M resource this would time out, but use a lot of memory in the 
process. Repeated access even below our request rate threshold would 
have been painful.

Best Wishes,
Alice

> Thanks again,
> Alex
>
>
> On Fri, Mar 6, 2015 at 12:41 PM, Alice Heaton <a.heaton at nhm.ac.uk 
> <mailto:a.heaton at nhm.ac.uk>> wrote:
>
>     Hello,
>
>     There is no setting in CKAN (to my knowledge) to help with this.
>     Things we have done here:
>
>     - Use nginx to limit the request rate (total and from a single IP)
>     to the datastore api;
>     - Ensure our servers can deal with as many requests as we allow
>     (so worst case the site is blocked, but the servers won't go down);
>     - Clear the response after each request. At the moment this
>     doesn't happen, so a worker keeps the last response in memory
>     until it used again. With 20 workers returning very large
>     responses, this can kill your memory very quickly. You'll need to
>     merge this PR to do this: https://github.com/ckan/ckan/pull/2262
>
>     I have been thinking of imposing a hard limit on number of rows
>     returned per request. This could be implemented as a middleware
>     which simply returns a 400 error if there is a limit higher than a
>     configured number. This is not a priority for us, but we might
>     come round to doing it at some point. If you implement something
>     like this we will definitely help with testing!
>
>     Another approach currently under discussion is to have a
>     streaming/chunked API, so the response is only build as the client
>     consumes it. This is only at discussion phase, but worth keeping
>     an eye on: https://github.com/ckan/ideas-and-roadmap/issues/128
>
>     Alice
>
>
>     On 06/03/15 00:32, Alex Gartner wrote:
>>     Hi everyone,
>>
>>     I have a question related to the datastore API being used by a
>>     user with *bad intentions* to achieve a denial of service of some
>>     kind. Since the project that I'm working on plans to have
>>     datastore tables with around 1 million rows I'm thinking this
>>     might be used against the system.
>>     To give a few examples:
>>
>>       * The following request to the "datastore_search_sql" endpoint
>>         takes  on my laptop around 2 minutes to complete for a
>>         datastore_table with 2 500 rows and limit 250 000. Without
>>         the limit I imagine it would take around 50 mins (There would
>>         be 2 500 x 2 500 rows in the response).
>>           o curl -G localhost:5000/api/action/datastore_search_sql
>>             --data-urlencode "sql=SELECT a.* from datastore_table a,
>>             datastore_table b limit 250000"
>>       * accessing the "datastore_search" endpoint with a limit of 250
>>         000 takes also about 2 mins ( for a table of around 500 000
>>         rows )
>>           o curl
>>             "http://localhost/api/action/datastore_search?resource_id=resource_id&limit=250000"
>>
>>     I imagine that somebody hitting the datastore API with X
>>     simultaneous requests for data with a limit of 1 million could
>>     block the server (while using all the db connections).
>>
>>     Is there a way to set a hard limit that cannot be overwritten by
>>     the user for the number of results returned by a query to the
>>     datastore (to force pagination in way) ?
>>
>>     And in more general terms, what would be the best practice for
>>     avoiding such issues ? Are there some CKAN settings that help
>>     with this ? Should we setup the web server ( nginx, apache ) to
>>     use a harsher limit on the number of simultaneous HTTP requests
>>     to the datastore API endpoints ?
>>
>>     Thanks for the help,
>>     Alex
>>
>>
>>     _______________________________________________
>>     ckan-dev mailing list
>>     ckan-dev at lists.okfn.org  <mailto:ckan-dev at lists.okfn.org>
>>     https://lists.okfn.org/mailman/listinfo/ckan-dev
>>     Unsubscribe:https://lists.okfn.org/mailman/options/ckan-dev
>
>
>     _______________________________________________
>     ckan-dev mailing list
>     ckan-dev at lists.okfn.org <mailto:ckan-dev at lists.okfn.org>
>     https://lists.okfn.org/mailman/listinfo/ckan-dev
>     Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150320/d3e494ef/attachment-0003.html>


More information about the ckan-dev mailing list