[wdmmg-discuss] Hierarchical keys [was: Re: First stab at a web API]

Thu Apr 15 11:34:16 UTC 2010

Last night Dave, Rufus and I had a long discussion about how to handle 
hierarchical keys, such as COFOG (function), POG (programme) and NUTS 
(region). Dave managed to persuade us that the store should understand 
these hierarchies.

COFOG (Classification Of Function Of Government) is our best example. It 
is a three-level hierarchy. A typical level-1 code is 04 ("Economic 
affairs"). A typical level-2 code is 04.3 ("Fuel and energy"). A typical 
level-3 code is 04.3.3 ("Nuclear energy"). The complete list is here:

 	http://ckan.net/package/cofog

If you have a look at the prototype, you will see that it often breaks the 
data down by level-1 function, but not usually by level-2 or level-3 
function. It is this sort of partial breakdown that Dave persuaded us to 
support.

Rufus and I considered several representations. We propose that the 
Simplest Thing That Could Possibly Work would be to have three separate 
keys called "cofog1", "cofog2" and "cofog3", one for each level.

This is a redundant representation, in the sense that if you know the 
"cofog3" of an account you can infer its "cofog2" and "cofog1". We are 
prepared to put up with this redundancy in exchange for having all three 
levels pre-computed, named, and searchable.

The store will publish meta-data describing the relationship between these 
columns. The key "cofog3" will have a "parent" attribute, with value 
"cofog2". Similarly, the parent of "cofog2" will be "cofog1". In addition, 
each enumeration_value will have a parent attribute too. For example, the 
code "04.3.3" will have a parent of "04.3", which will in turn have a 
parent of "04". This will require a small change to our domain model.

With this design, I don't think any extra functionality needs to be added 
to the aggregator to support hierarchical filters and breakdowns. The 
request can specify the level of filtering or breakdown simply by choosing 
one of the three COFOG columns. On the contrary, this ability removes the 
requirement for prefix matching in the filtering commands. I will 
downgrade it to exact matching, as this solves some potential problems.

To illustrate the proposed design, Dave asked me to annotate his use cases 
below. For each one, he asked me to give the URL that performs the 
calculation, and the number of results returned. See below for my answers. 
Except where noted, these answers use the current version of the API, but 
assume a future (the proposed) version of the database structure.

On Wed, 7 Apr 2010, Dave Boyce wrote:

> 1. Spending by all top-level functions, aggregate all locations, for a 
> single year. Actual spending values.

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes
 	&breakdown-cofog1=
 	&start_date=2003-01-01
 	&end_date=2004-01-01

10 results. No need to mention "region" because aggregation is the default 
behaviour.

> 2. Spending by single top-level function and next two sub-levels, 
> aggregate all locations, single year. Actual spending values.

 	/api/aggregate?
 	?slice=cra
 	&exclude-spender=yes
 	&filter-cofog1=04
 	&breakdown-cofog2=
 	&breakdown-cofog3=
 	&start_date=2003-01-01
 	&end_date=2004-01-01

20 results (one for each sub-sub-function of "04 Economic affairs").

The inclusion of "breakdown-cofog2=" is optional. If you omit this 
parameter, it will not change the number of results, nor the calculated 
spending amounts, as the data is already broken down by cofog3. However, 
including the redundant cofog2 breakdown will create an extra axis. This 
makes the server return the cofog2 code for each spending amount returned, 
which might be useful.

> 3. Spending by top-level funding, aggregate all locations, all available 
> years. Actual spending values, spending in real terms, spending as %gdp.

I assume "funding" is a typo for "function".

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes
 	&breakdown-cofog1=

80 results (10 cofog1 functions by 8 years).

The current API can't do spending in real terms or as % of GDP yet. This 
will probably involve adding one of the following request parameters:

 	&per-gbp2003=

 	&per-gdp=

These extensions will rely on time series data for inflation (the then 
value of a 2003 pound) and for GDP. Other ideas for time series are 
welcome.

> 4. Total aggregate spending, aggregate all locations, all available 
> years. Actual spending values, spending in real terms, spending as %gdp.

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes

 	&per-gbp2003=

 	&per-gdp=

8 results in each case (one for each year).

> 5. Spending by function and sub-function. Aggregate all regions, or a 
> single region. Actual spending values.

Assuming a single year.

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes
 	&breakdown-cofog1=
 	&breakdown-cofog2=
 	&start_date=2003-01-01
 	&end_date=2004-01-01

For single region, add:

 	&include-region=SCOTLAND

68 results in each case (one for each level-2 COFOG code). Redundant 
breakdown by cofog1 is optional but probably useful.

> 6. Spending by function and all sub-functions. All regions. Actual 
> spending, spending by capita.

Assuming a single year.

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes
 	&breakdown-cofog1=
 	&breakdown-cofog2=
 	&start_date=2003-01-01
 	&end_date=2004-01-01

68 results. The current API cannot calculate per-capita spending. This 
will probably be achieved by adding the following request parameter:

 	&per-population2003=region

Still 68 results. The calculation will rely on 2003 population statistics 
attached to the region objects (enumeration_values) in the store. (To get 
a time series we will have to make the approximation that population does 
not change over time). Ideas for other attributes of regions, or indeed of 
any other axis, are welcome.

> 7. Spending by all top-level functions, all regions. Actual spending, 
> spending by capita, spending per capita indexed (% deviation from 
> national average).

Assuming a single year.

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes
 	&breakdown-cofog1=
 	&breakdown-region=
 	&start_date=2003-01-01
 	&end_date=2004-01-01

 	&per-population=region

120 results (10 regions by 12 level-1 COFOG codes). The current API can't 
do % deviation from average. This will probably involve adding the 
following request parameter:

 	&mean=100

Still 120 results.

Depending on how we end up coding the regions, you might want to add:

 	&exclude-region=NON-IDENTIFIABLE
 	&exclude-region=OUTSIDE%20UK

This question will become clearer later. For now, the important thing is 
that the API can handle it.

> 8. Spending by all top-level functions, single region. Actual spending, 
> spending by capita, spending per capita indexed (% deviation from 
> national average).

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes
 	&breakdown-cofog1=
 	&include-region=SCOTLAND

 	&per-population=region

10 results (one for each level-1 COFOG code).

 	&mean=100

Beware: this will normalise to Scotland, and since there are no 
sub-regions all results will be 100. To normalise to the average spending 
for the whole UK instead, you need to request data for the whole UK, by 
removing the "include-region=SCOTLAND" part, as follows:

 	/api/aggregate
 	?slice=cra
 	&exclude-spender=yes
 	&breakdown-cofog1=
 	&per-population=region
 	&mean=100

120 results. Then extract the 10 results you want.

---

I apologise for any mistakes in the above. Obviously I have not tested 
these URLs, as they won't work on the current database. I have tried to be 
careful but there's a lot of detail.

The above examples generally avoided the following limitations of the 
API, which should nonetheless be noted:

  - Cannot aggregate over years.
  - Cannot divide by something that depends on two axes.
  - Cannot divide by something that depends on an axis and also on time.

 	Alistair