[ckan-dev] datastore - large queries

Alex Gartner alexandru.gartner+ckan at gmail.com
Tue Mar 24 09:41:59 UTC 2015


Hi,

didn't know about the URL for exporting a datastore table to csv. We
definitely need to restrict access to it.

This is really useful for us. Thank you,
Alex

On Fri, Mar 20, 2015 at 2:52 PM, Alice Heaton <a.heaton at nhm.ac.uk> wrote:

>  Hi,
>
> On 09/03/15 11:30, Alex Gartner wrote:
>
> Hi,
>
>  thank you for the response.
> I think that in the short-term we'll follow your advises about limiting
> requests from nginx and merging the PR. If we decide to go beyond that for
> a longer-term solution will let you know.
>
>
> Something else I just noticed going through our logs: you might want to
> block external access to /datastore/dump/<resource id> for large resources
> as this attempts to generate a CSV of the whole resource. On our 2.7M
> resource this would time out, but use a lot of memory in the process.
> Repeated access even below our request rate threshold would have been
> painful.
>
> Best Wishes,
> Alice
>
>
>  Thanks again,
> Alex
>
>
> On Fri, Mar 6, 2015 at 12:41 PM, Alice Heaton <a.heaton at nhm.ac.uk> wrote:
>
>>  Hello,
>>
>> There is no setting in CKAN (to my knowledge) to help with this. Things
>> we have done here:
>>
>> - Use nginx to limit the request rate (total and from a single IP) to the
>> datastore api;
>> - Ensure our servers can deal with as many requests as we allow (so worst
>> case the site is blocked, but the servers won't go down);
>> - Clear the response after each request. At the moment this doesn't
>> happen, so a worker keeps the last response in memory until it used again.
>> With 20 workers returning very large responses, this can kill your memory
>> very quickly. You'll need to merge this PR to do this:
>> https://github.com/ckan/ckan/pull/2262
>>
>> I have been thinking of imposing a hard limit on number of rows returned
>> per request. This could be implemented as a middleware which simply returns
>> a 400 error if there is a limit higher than a configured number. This is
>> not a priority for us, but we might come round to doing it at some point.
>> If you implement something like this we will definitely help with testing!
>>
>> Another approach currently under discussion is to have a
>> streaming/chunked API, so the response is only build as the client consumes
>> it. This is only at discussion phase, but worth keeping an eye on:
>> https://github.com/ckan/ideas-and-roadmap/issues/128
>>
>> Alice
>>
>>
>> On 06/03/15 00:32, Alex Gartner wrote:
>>
>>   Hi everyone,
>>
>>  I have a question related to the datastore API being used by a user
>> with *bad intentions* to achieve a denial of service of some kind. Since
>> the project that I'm working on plans to have datastore tables with around
>> 1 million rows I'm thinking this might be used against the system.
>> To give a few examples:
>>
>>    - The following request to the "datastore_search_sql" endpoint takes
>>     on my laptop around 2 minutes to complete for a datastore_table with 2 500
>>    rows and limit 250 000. Without the limit I imagine it would take around 50
>>    mins (There would be 2 500 x 2 500 rows in the response).
>>     - curl -G localhost:5000/api/action/datastore_search_sql
>>       --data-urlencode "sql=SELECT a.* from datastore_table a, datastore_table b
>>       limit 250000"
>>    - accessing the "datastore_search" endpoint with a limit of 250 000
>>    takes also about 2 mins ( for a table of around 500 000 rows )
>>     - curl "
>>       http://localhost/api/action/datastore_search?resource_id=resource_id&limit=250000
>>       "
>>
>>  I imagine that somebody hitting the datastore API with X simultaneous
>> requests for data with a limit of 1 million could block the server (while
>> using all the db connections).
>>
>>  Is there a way to set a hard limit that cannot be overwritten by the
>> user for the number of results returned by a query to the datastore (to
>> force pagination in way) ?
>>
>>  And in more general terms, what would be the best practice for avoiding
>> such issues ? Are there some CKAN settings that help with this ? Should we
>> setup the web server ( nginx, apache ) to use a harsher limit on the number
>> of simultaneous HTTP requests to the datastore API endpoints ?
>>
>>  Thanks for the help,
>> Alex
>>
>>
>>  _______________________________________________
>> ckan-dev mailing listckan-dev at lists.okfn.orghttps://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>
>
> _______________________________________________
> ckan-dev mailing listckan-dev at lists.okfn.orghttps://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20150324/0912fe93/attachment-0003.html>


More information about the ckan-dev mailing list