[ckan-discuss] CKAN Performance

Mon Aug 30 00:09:55 BST 2010

 Take this page, for example,

    http://de.ckan.net/tag/destatis

It does some really big, complicated joins. The
culprit seems to be the SQLAlchemy eagerload_all
function which loosely translates into a LEFT OUTER
JOIN.

For those that might not be so familiar with SQL,
the result of this type of query is one row for each
combination of package and tag. Each row has
complete information about the package (so this
is repeated many times) and complete information
about the tag (so this is repeated many times).

(has that page loaded for you yet?)

Not only is this pretty hard on the database from
an I/O perspective, the ckan application has to
do quite a lot of work as well. For each row it
has to look and see if it already has seen that
package, create it if necessary, then check to see
if it has seen the tag before, create it if necessary.
This eats a lot of CPU cycles.

Now this page is only producing a table of package
names and the tags that go with it. But the data
that is transferred between the database and ckan
is many copies of the entire package metadata.

(has that page loaded for you yet?)

To put some rough numbers on it, a similar query
is done by

    http://de.ckan.net/package/list?page=B

(that page will have loaded by now because it is
being cached).

It does two of these big queries. Rendered
as CSV, directly from the database, each of those
queries returns about 2.8MB of data. Cut down by
hand to only retrieve the package and tag names
returns about 42kB of data. That's about 67x
more information being transferred than there
needs to be. Clearly there is some room for
optimisation.

This excess, however brings convenience. It
means that, once the server has finished thinking,
you have complete python objects that you can
manipulate and do fun things with like render
in templates.

I can see three paths to make performance better.

The first is to just cache the output like was
done with the list page. This has the disadvantage
that things like the login link or the currently
logged in user will be wrong.

The second is to try to make those queries in
a more efficient way. This would have the
disadvantage of perhaps making the code not
as clear to read as it would mean using less
ORM magic and getting closer to the SQL.

The third is to leave the big queries alone but
do a simpler query first to check if anything
has changed. If nothing has changed, set the
cache headers in the HTTP response properly.
This relies on the users' browsers and any
proxies involved behaving correctly.

A variant on the third is to change templating
engines to one that can cache only parts of
pages and use that as well.

I personally favour doing both the second and
third, but am quite open to suggestions and
opinions on the matter.

(the first link has probably loaded by now...
I hope)

Cheers,
-w

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK

RDF Indexing, Clustering and Inferencing in Python
		http://ordf.org/