[ckan-dev] Performance CKAN / Configuration Parameters

Wed Aug 20 07:33:55 UTC 2014

Hi Ian,

we have round about 10.000 datasets. BUT, each with many extra fields and
one which contains the full-text of the referenced resources. This might 
be huge,
however, only 20 results are computed for the resulting web-pages.

Thus, it would help not to retrieve the complete data_dict from SOLR
(see get.py data_dict['fl']='id data_dict')

but specific parts of the data_dict. However, a problem could be
that not every information stored in field data_dict has a corresponding 
(extra) field
in the SOLR-schema which could be referenced with 'fl'...

Perhaps it's seems not to be a good idea to put the full-text in an 
extra field. However,
it was the direct/easiest way to do it and have indexing...

Is it possible to adding a field to the SOLR-schema without many 
changes? This field
would than be outside the data_dict. It would only be needed for 
indexing and highlighting
text passages.

Reproducing the environment: CKAN 2.1.2, harvest extension, spatial 
extension, our own
harvesters (however, the whole discussion is only for retrieving not for 
harvesting, the
harvesters run on a different machine(!)). We have own Web-pages, but 
the performance results
were the same for the default CKAN-web-pages. Full-text-analysis with 
Apache-Tika adding an
extra field with a string for the full-text.

Best wishes for now and thank you for the reply,

Lothar

Am 19.08.2014 13:42, schrieb Ian Ward:
> Hello Lothar,
>
> How many datasets does your test instance have? Are you using many
> extra fields? tag vocabularies? Which pages are you hitting?
>
> There are lots of places that CKAN can get slow but that are easy to
> fix or work around. Most instances don't push CKAN very hard so we
> tend to fix the problems only as they come up. If you're hitting some
> of those problem areas and tweaking server settings isn't going to
> help much.
>
> Would you help us to reproduce your test environment?
>
> On Fri, Aug 15, 2014 at 12:47 PM, hotz <hotz at informatik.uni-hamburg.de> wrote:
>> Hi Alice,
>>
>> thank you very much for your reply!
>> I'm still checking out things and will later answer in more detail.
>>
>> For now:
>> - we enhance the maximal connections of Postgres from default 100 to 120
>> connections
>> on a 2GB development machine and to 1000 connections with 24 GB shared
>> memory
>> on a 32 GB production machine. This enabled (of course) more concurrent
>> requests.
>> (If the connections are getting too many there is a log entry in CKAN and in
>> the DB-log,
>> which helped tuning.)
>>
>> - we moved from apache/prefork to apache/worker according to
>> [http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html].
>> However, I'm still not sure if this is needed because
>> mod_wsgi runs in daemon mode in our case.
>>
>> - with this configuration, we currently get on a 8 core machine, 24GB (the
>> Postgres is on a different machine,
>> apache server with mod_wsgi, no datastore, no varnish) and the following
>> scenario the data shown below [1]:
>>
>> Scenario: Start with 128 concurrent users, each clicks on three pages
>> (start, query, detail of one dataset), waits 5 seconds,
>> starts again. This, 5 minutes long. "Users" at two locations (Hamburg and
>> Bremen).
>> --> Page loading is between 35s to 75s in average. :-(
>> --> but no failure. :-)
>>
>> With 16 concurrent users, we got page loads way below 10s.
>> With more than 128 users we got failures, which still have to be analyzed.
>>
>> Next steps: with varnish, tune DB, Apache and WSGI according to your
>> suggestions.
>>
>> Perhaps use of uwsgi instead of mod_wsgi (if CKAN allows it..).
>>
>> That's for now, best wishes and thanx again!
>> Lothar
>>
>> [1]
>>
>> ===================================
>> 5 minutes run
>> 128 users starting(two locations, HH and HB).
>> Three pages: start, query, show detail.
>> Two page (start and query) reported below.
>> "Test" meaning one run through the scenario.
>>
>>                 HH     HB
>> Users          64     64
>> Req/sec        37,9   31,3
>> #Tests        119     83
>> Page Start    38,9s   47,6s   (average)
>> Page Query    57,1s   75,8s   (average)
>> Failure        0       0
>>
>> ===================================
>>
>>
>> Apache configuration:
>>
>> KeepAlive OFF
>>
>> <IfModule mpm_worker_module>
>>      StartServers          50
>>      MinSpareThreads      25
>>      MaxSpareThreads      2500
>>      ThreadLimit          2500
>>      ThreadsPerChild      50
>>      ServerLimit         1300
>>      MaxClients          1300
>>      MaxRequestsPerChild  0
>> </IfModule>
>>
>> WSGI:
>> processes=2 threads=120
>>
>> ===> This might be the reason that we get failures if 256 users
>> have to be served.
>>
>> 1300 MaxClients/ 50 ThreadsPerChild = 26 processes
>>
>> ps -dealf | grep apache | grep www_data | wc
>> --> 27 Apache processes (1 wait)
>>
>> ps -dealf | grep ckan_default | grep www_data | wc
>> --> 2 ckan_default processes
>>
>> ===> This has to be aligned. It looks like that 26 apache processes
>> send requests to 2 ckan_default processes. Main question: what are
>> the relations behind apache processes
>>
>>
>>
>> Am 11.08.2014 12:08, schrieb Alice Heaton:
>>
>>> Hello,
>>>
>>> You may already be aware of these things, but just to throw in some ideas:
>>>
>>> - A single process will only run on a single CPU so by setting
>>> 'processes=2' it means your CKAN application will only run on 2 CPUs.
>>>    This might be what you want (to reserve the other CPUs for
>>> Postgres/Jetty etc.), but good to keep in mind;
>>>
>>> - I don't think ServerLimits affects wsgi daemon mode - though I may be
>>> wrong about this;
>>>
>>> - With processes=2 and threads=30, you will serve at most 2*30 concurrent
>>> requests;
>>>
>>> - It's important to remember that all requests, including those for static
>>> files, go through mod_wsgi.
>>>    If browsers are firing, say, 8 concurrent requests then that leaves you
>>> with 2*30/8 concurrent clients
>>>    (this is very approximate, it will depend on the time taken for each
>>> request, client caching, etc. however
>>>    it's a good way to get an idea of what is happening). The best way to
>>> deal with this is to add a caching
>>>    server in front (say nginx or varnish) to ensure static files are
>>> cached;
>>>
>>> - You don't mention PostgreSQL settings, and whether you use the datastore
>>> (and if so with how many rows).
>>>     On our setup (with the datastore and tables with over 3,000,000 rows),
>>> PostgreSQL is the slow point.
>>>
>>>     The default PostgreSQL settings are very conservative. The first thing
>>> to do there is to increase shared_buffers -
>>>     the recommended value is about 25% of available memory. The next one to
>>> set in effective_cache_size, this
>>>     should be roughly shared_buffers plus the amount of system caches.
>>>
>>>     What will make a real difference for a large database is to set
>>> work_mem. This has to be tuned carefully, as you are
>>>     setting the memory available for each operation in a query - so a query
>>> with 12 joins will use up to 12*work_mem.
>>>     If you set this to low, then your sorts/joins will happen on disc -
>>> which can be very slow. If you set this to high, you might
>>>     run out of memory!
>>>
>>>     The best way to work this out is to enable slow query logging, and look
>>> for the slow queries. explain analyze will tell you
>>>     how much memory they need, and whether the operations happen on disc or
>>> in memory. Increase work_mem to make them
>>>     happen in memory (if possible).
>>>
>>> - In postgres, you should also check the number of allowed connections.
>>> Depending on your settings/plugins, CKAN may make more
>>>    than one connection per request. With 2*30 workers, if each worker makes
>>> more than one connection then you will run out.
>>>
>>> I'm interested in hearing about anything else you find that affects
>>> performance, so please let us know !
>>>
>>> Best Wishes,
>>> Alice Heaton
>>>
>>> On 06/08/14 18:54, hotz wrote:
>>>> Hi all,
>>>>
>>>> we do performance tests with following setup:
>>>> - CKAN 2.1.2
>>>> - Search queries via the default ckan-portal and a web-portal
>>>> - Ramp tests of 50,100,200...600...1000,...5000 users per second (!)
>>>>    (we expect such numbers in the early on-line phase of our portal)
>>>> - 24 GB RAM, 8 CPUs
>>>>
>>>> Following parameter configurations in:
>>>> 1) apache2.conf
>>>>   ServerLimits 300
>>>>   MaxClients 300
>>>>   for all occurrences
>>>>
>>>> 2) Jetty Java_Options:
>>>> -Xms512M -Xmx4g
>>>>
>>>> 3) virtual host ckan_default:
>>>> WSGIDaemonProcess ckan_default display-name=ckan_default processes=2
>>>> threads=30
>>>>
>>>> We get response time of ca. mean 30 seconds each of 600 concurrent users
>>>> per second.
>>>> And several errors. Which altogether we feel bad with.
>>>>
>>>> The CPUs are 50% active, RAM uses only ca. 5GB (of the 24 GB).
>>>> The ckan-portal and web-portal have same results.
>>>>
>>>>
>>>> Is there somebody who can explain the above parameters and optimal
>>>> settings of them and their influences?
>>>> E.g. do apache workers correspond to threads? Are there multiple jetty
>>>> processes or only one if CKAN is running?
>>>> Has anybody experiences in this direction or hints to further
>>>> information?
>>>>
>>>> Best wishes,
>>>> Lothar
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> ckan-dev mailing list
>>> ckan-dev at lists.okfn.org
>>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev

-- 
Dr. Lothar Hotz
HITeC e.V., University of Hamburg
Vogt-Kölln-Str. 30, 22527 Hamburg
Tel: 040/42883-2605; Fax: 040/42883-2572
E-Mail: hotz at informatik.uni-hamburg.de
WWW: www.hitec-hh.de
Private page: kogs-www.informatik.uni-hamburg.de/~hotz