[wdmmg-discuss] Hierarchical keys [was: Re: First stab at a web API]
Alistair Turnbull
apt1002 at goose.minworks.co.uk
Thu Apr 15 11:34:16 UTC 2010
Last night Dave, Rufus and I had a long discussion about how to handle
hierarchical keys, such as COFOG (function), POG (programme) and NUTS
(region). Dave managed to persuade us that the store should understand
these hierarchies.
COFOG (Classification Of Function Of Government) is our best example. It
is a three-level hierarchy. A typical level-1 code is 04 ("Economic
affairs"). A typical level-2 code is 04.3 ("Fuel and energy"). A typical
level-3 code is 04.3.3 ("Nuclear energy"). The complete list is here:
http://ckan.net/package/cofog
If you have a look at the prototype, you will see that it often breaks the
data down by level-1 function, but not usually by level-2 or level-3
function. It is this sort of partial breakdown that Dave persuaded us to
support.
Rufus and I considered several representations. We propose that the
Simplest Thing That Could Possibly Work would be to have three separate
keys called "cofog1", "cofog2" and "cofog3", one for each level.
This is a redundant representation, in the sense that if you know the
"cofog3" of an account you can infer its "cofog2" and "cofog1". We are
prepared to put up with this redundancy in exchange for having all three
levels pre-computed, named, and searchable.
The store will publish meta-data describing the relationship between these
columns. The key "cofog3" will have a "parent" attribute, with value
"cofog2". Similarly, the parent of "cofog2" will be "cofog1". In addition,
each enumeration_value will have a parent attribute too. For example, the
code "04.3.3" will have a parent of "04.3", which will in turn have a
parent of "04". This will require a small change to our domain model.
With this design, I don't think any extra functionality needs to be added
to the aggregator to support hierarchical filters and breakdowns. The
request can specify the level of filtering or breakdown simply by choosing
one of the three COFOG columns. On the contrary, this ability removes the
requirement for prefix matching in the filtering commands. I will
downgrade it to exact matching, as this solves some potential problems.
To illustrate the proposed design, Dave asked me to annotate his use cases
below. For each one, he asked me to give the URL that performs the
calculation, and the number of results returned. See below for my answers.
Except where noted, these answers use the current version of the API, but
assume a future (the proposed) version of the database structure.
On Wed, 7 Apr 2010, Dave Boyce wrote:
> 1. Spending by all top-level functions, aggregate all locations, for a
> single year. Actual spending values.
/api/aggregate
?slice=cra
&exclude-spender=yes
&breakdown-cofog1=
&start_date=2003-01-01
&end_date=2004-01-01
10 results. No need to mention "region" because aggregation is the default
behaviour.
> 2. Spending by single top-level function and next two sub-levels,
> aggregate all locations, single year. Actual spending values.
/api/aggregate?
?slice=cra
&exclude-spender=yes
&filter-cofog1=04
&breakdown-cofog2=
&breakdown-cofog3=
&start_date=2003-01-01
&end_date=2004-01-01
20 results (one for each sub-sub-function of "04 Economic affairs").
The inclusion of "breakdown-cofog2=" is optional. If you omit this
parameter, it will not change the number of results, nor the calculated
spending amounts, as the data is already broken down by cofog3. However,
including the redundant cofog2 breakdown will create an extra axis. This
makes the server return the cofog2 code for each spending amount returned,
which might be useful.
> 3. Spending by top-level funding, aggregate all locations, all available
> years. Actual spending values, spending in real terms, spending as %gdp.
I assume "funding" is a typo for "function".
/api/aggregate
?slice=cra
&exclude-spender=yes
&breakdown-cofog1=
80 results (10 cofog1 functions by 8 years).
The current API can't do spending in real terms or as % of GDP yet. This
will probably involve adding one of the following request parameters:
&per-gbp2003=
&per-gdp=
These extensions will rely on time series data for inflation (the then
value of a 2003 pound) and for GDP. Other ideas for time series are
welcome.
> 4. Total aggregate spending, aggregate all locations, all available
> years. Actual spending values, spending in real terms, spending as %gdp.
/api/aggregate
?slice=cra
&exclude-spender=yes
&per-gbp2003=
&per-gdp=
8 results in each case (one for each year).
> 5. Spending by function and sub-function. Aggregate all regions, or a
> single region. Actual spending values.
Assuming a single year.
/api/aggregate
?slice=cra
&exclude-spender=yes
&breakdown-cofog1=
&breakdown-cofog2=
&start_date=2003-01-01
&end_date=2004-01-01
For single region, add:
&include-region=SCOTLAND
68 results in each case (one for each level-2 COFOG code). Redundant
breakdown by cofog1 is optional but probably useful.
> 6. Spending by function and all sub-functions. All regions. Actual
> spending, spending by capita.
Assuming a single year.
/api/aggregate
?slice=cra
&exclude-spender=yes
&breakdown-cofog1=
&breakdown-cofog2=
&start_date=2003-01-01
&end_date=2004-01-01
68 results. The current API cannot calculate per-capita spending. This
will probably be achieved by adding the following request parameter:
&per-population2003=region
Still 68 results. The calculation will rely on 2003 population statistics
attached to the region objects (enumeration_values) in the store. (To get
a time series we will have to make the approximation that population does
not change over time). Ideas for other attributes of regions, or indeed of
any other axis, are welcome.
> 7. Spending by all top-level functions, all regions. Actual spending,
> spending by capita, spending per capita indexed (% deviation from
> national average).
Assuming a single year.
/api/aggregate
?slice=cra
&exclude-spender=yes
&breakdown-cofog1=
&breakdown-region=
&start_date=2003-01-01
&end_date=2004-01-01
&per-population=region
120 results (10 regions by 12 level-1 COFOG codes). The current API can't
do % deviation from average. This will probably involve adding the
following request parameter:
&mean=100
Still 120 results.
Depending on how we end up coding the regions, you might want to add:
&exclude-region=NON-IDENTIFIABLE
&exclude-region=OUTSIDE%20UK
This question will become clearer later. For now, the important thing is
that the API can handle it.
> 8. Spending by all top-level functions, single region. Actual spending,
> spending by capita, spending per capita indexed (% deviation from
> national average).
/api/aggregate
?slice=cra
&exclude-spender=yes
&breakdown-cofog1=
&include-region=SCOTLAND
&per-population=region
10 results (one for each level-1 COFOG code).
&mean=100
Beware: this will normalise to Scotland, and since there are no
sub-regions all results will be 100. To normalise to the average spending
for the whole UK instead, you need to request data for the whole UK, by
removing the "include-region=SCOTLAND" part, as follows:
/api/aggregate
?slice=cra
&exclude-spender=yes
&breakdown-cofog1=
&per-population=region
&mean=100
120 results. Then extract the 10 results you want.
---
I apologise for any mistakes in the above. Obviously I have not tested
these URLs, as they won't work on the current database. I have tried to be
careful but there's a lot of detail.
The above examples generally avoided the following limitations of the
API, which should nonetheless be noted:
- Cannot aggregate over years.
- Cannot divide by something that depends on two axes.
- Cannot divide by something that depends on an axis and also on time.
Alistair
More information about the openspending
mailing list