[ckan-discuss] Varnishing Squids and Beakers

William Waites william.waites at okfn.org
Wed Sep 1 01:04:53 BST 2010

 So I've been looking a bit at caching to try to improve
the CKAN performance. Rufus and John had suggested
to look at Varnish, John had already taken a bit of a
look and found the configuration language complicated.

To begin with, earlier today I added caching in the
ckan application itself and made some JavaScript
thing so that the state looks right depending on if
you are logged in or not. The /package/list and
/tag/xyz pages should be appreciably faster but this
is a bit of a stopgap.

I had a look at Varnish and I agree that the configuration
language is complicated. In fact by default Varnish
disregards cache control headers and in general behaves
in a very standards non-compliant way. I have no doubt
that it is very fast -- if you are willing to spend the
efford to customise its configuration for the exact
layout of pages and headers and such that each web
site it is going to be used with will use. In other words,
there is a large administrative burden.

So I decided to change tack and see where the Squid
proxy has gotten to in the decade or so since I last met
it. Squid is a general purpose caching proxy that can
be configured as an http accelerator. The configuration
is simple. You tell it where your web servers are for
which sites. The web servers make sure to set the
cache control headers appropriately.

Here are some results from my testing, against
http://de.ckan.net/package/list?page=B which is an
example of a slow page. Except for the first, which
only did 100 requests, the tests were set to 8
simultaneous connections and a total of 1000

No caching of any kind:
    Requests per second:    0.44 [#/sec] (mean)

Beaker Cache (filesystem):
    Requests per second:    43.16 [#/sec] (mean)

SQUID setting cache control headers correctly:
    Requests per second:    421.33 [#/sec] (mean)

The results are clear. Using the application cache is
about 100 times faster than doing nothing. Using
squid is about 1000 times faster. (Doing both wouldn't
necessarily help very much).

I'm sure we could squeeze a bit more performance out
of it if we used Varnish, but probably not an order of
magnitude and I don't think it is worth the
administrative burden.

If we set up a production Squid instance (or farm),
with a bare minimum of work it can cache for any
number of sites, not just CKAN.

For the python coders, here's what you have to do
to set the headers properly so that squid will cache
the page:

       del response.headers["Pragma"]
       del response.headers["Cache-Control"]
       from time import gmtime, strftime
       response.headers["Last-Modified"] = strftime("%a, %d %b %Y
%H:%M:%S GMT", gmtime())

A further advantage is that the *browsers* will also
understand these cache-control headers and do their
own caching - just setting them properly without
even using Squid should result in some subjective
performance improvements.

That's all for now, I suggest we dedicate a machine
to just running squid, the more RAM the better and
big discs are good, and put it between the world and
the ckans. Oh, and comb through the controllers
setting the headers correctly where appropriate...


William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python

More information about the ckan-discuss mailing list