[wdmmg-discuss] Interfacing the presentation layer to the server

Tue Mar 30 16:51:46 UTC 2010

I've just spoken with Dave Boyce who will be writing the snazzy Flash 
presentation layer for our site, to try to agree what kind of API it will 
use to request data from the server. This post is a write-up of the 
conversation, before I forget it.

Scope
-----

In the short term, we are going to concentrate on the CRA data set only. 
However, we do think it is worth parameterising the things which are 
likely to be different for other data sets in the future. The main 
examples of future data sets are the local spending reviews and COINS 
(hopefully!).

The main thing that needs to be parameterised (and which was not 
parameterised in the prototype) is the list of axes along which the data 
can be broken down. For the CRA, the axes are: Dept, Region, POG, COFOG. 
For other data sets, the axes are likely to be different.

More generally, it will no longer do to compile the data into the Flash 
program. The Flash program will instead need to request data from the 
server.

The time axis
-------------

We agreed that the time axis is a special case. It differs from the other 
axes in several ways:

  - The time axis is dense. By this I mean that if you find a (Dept, 
Region, POG, COFOG, time) combination for which the amount of spending is 
non-zero, and then change the time only, the new combination will probably 
also have a non-zero amount of spending. This is often not true of the 
other axes.

  - The time axis is ordered.

  - The time axis is roughly the same for all data sets.

  - Users will generally not want to sum along the time axis.

Dave also requested that we treat the values on the time axis as opaque 
strings, not as dates. In the data available to us, the column headings 
are of the form "2008-09", for example. The data in that column is a total 
for the period from 1st April 2008 to 31st March 2009, I think. Until now 
I have been loading all the spending in that column into the database with 
a timestamp of 1st April 2008, but Dave is quite right to point out that I 
am just inventing precision that doesn't really exist.

Division of labour: aggregation
-------------------------------

We discussed whether aggregation should be done on the server or the 
client. (By "client" I mean the Flash program that Dave is writing, 
which will run in the user's web browser).

The advantage of doing aggregation on the server is that it reduces the 
amount of data that needs to be sent to the client. To take Dave's 
example, the prototype only ever shows about 100 numbers at the same time. 
It is probably never worth it downloading all 150,000 numbers in the 
database.

The advantage of doing aggregation on the client is that CPU is plentiful. 
It doesn't much matter if the client has to pause to do a large 
calculation; it will not affect anybody else. However, if the server has 
to pause then that limits the number of users, and it becomes a 
significant expense.

We might be able to get the best of both worlds, by doing the calculations 
on the server and cacheing the results. It might have to be quite a large 
cache, because there are many subtly different requests that the client 
can make. The hope is that a small subset of the possible requests will 
account for 95% of the traffic, so that the server will only have to stop 
and think rarely.

We decided to try using a cache at first. If that works, it will have been 
the right decision. If it doesn't work, we will give up and instead do the 
calculations on the client. We calculated that the client will only have 
to download about 2Mb of data to get a copy of the entire data set, so 
this fall-back option is certainly feasible.

Division of labour: search
--------------------------

Dave wants to build a search facility into the client. This is fine by me. 
He will probably not need to download any extra data to make it work, 
beyond what he would have needed anyway for the presentation.

There will probably also be a search facility for those browsing the store 
in HTML form: the minority of users who click on the "Edit" button. 
However, this is my problem, not Dave's.

New requirements: aggregator
----------------------------

My first stab at the API for the aggregator seems to be roughly right in 
some respects, and completely wrong in others. I need to make the 
following changes:

  - It must return time series, not just totals for the period. As 
mentioned earlier, we want to treat the time axis as dense, while keeping 
the other axes sparse.

  - It must support filtering by key/value pairs. For example, if the user 
drills down to a particular COFOG code and region, then the server must be 
able to return data for just that COFOG and region.

With these changes, the JSON returned (server to client) by the aggregator 
will probably turn out something like this:

 	{
 	  "metadata": {
 	    "slice": "cra",
 	    "filters": {"cofog": "2.4.1", "region": "London"},
 	    "axes": ["dept", "cofog", "pog", "region"],
 	    "times": ["2003-04", "2004-05", "2005-06", etc... ]
 	  },
 	  "results": [
 	    [
 	      ["Dept032", "2.4.1", "S71000502", "London"],
 	      [56.1, 50.1, 51.5, etc... ]
 	    ], [
 	      ["Dept032", "2.4.1", "S71000503", "London"],
 	      [19.8, 21.5, 20.0, etc...]
 	    ], etc...
 	  ]
 	}

(The "slice", "filters" and "axes" fields in the metadata are only there 
to confirm that the server has understood the client's request).

This JSON format is reasonably dense (ignoring white space). If you 
downloaded the entire data set in this form, broken down as much as 
possible, it would only come to about 2Mb.

I need to work out what the request URL should look like for the above 
JSON response. Probably not far off my first stab, but the filtering needs 
some thought. I will try to integrate the filtering with the ugly 
"spender_key", "spender_value" mechanism, in the hope of coming up with 
something that is tidier overall.

New requirements: other requests
--------------------------------

We also identified two other things that the client will need to ask of 
the server:

  - Slice metadata. This includes what axes are available, for example.

  - Axis metadata. This includes an index explaining what all the codes 
mean, for example.

We haven't yet gone into any detail yet.

Best wishes,

 	Alistair