[wdmmg-discuss] A data model

Tue Feb 16 04:19:10 UTC 2010

De-lurk
=======

Hello! I am a contractor, and I hope I will be working on the Where Does 
My Money Go project for the next few months. My skills are in software 
engineering.

Meeting today
=============

Earlier today I met Rufus Pollock and John Bywater. Our goal was to work 
out what kind of store we're going to build for the data that we have, and 
for the data that we expect to get in the future. In the rest of this post 
I'm going to try to summarise what we talked about.

There are a lot of open questions, but the big one is what the data model 
should be. This has to balance a lot of competing considerations:

  - What data have we actually got?

  - What do we want to use it for?

  - If we crowd-source stuff, what kind of imperfections do we need to cope 
with?

  - What is technically feasible?

In a way, the data model summarises the essential features of the whole 
project, so that's what we focused on today.

Data sets
=========

I'm new to this project, so I'm sure the following list is incomplete and 
probably not quite accurate! Please correct and append.

  - Country Regional Analysis (CRA) - All UK government spending, broken 
into about 26,000 rows by about ten years. The rows are distinguished by a 
number of classifiers, including department, region, and function, where 
possible. Apart from year, the classifiers do not seem to be orthogonal.

  - UK Local Spending Review - All spending by UK local authorities. The 
spending is broken down first by the local authority responsible, then by 
function. I'm not sure how many years' data we have.

  - Public Expenditure Statistical Analysis (PESA)  - The highest level 
breakdown of Government spending. Similar to CRA but across countries and 
at a coarser granularity. This data set was used for the prototype at 
http://www.wheredoesmymoneygo.org/prototype/ .

  - Supplementary Budget Information - Can somebody fill in a description 
here?

  - More?

Exhaustive and non-overlapping
------------------------------

The CRA, the PESA, and the UK Local Spending Review all have the following 
common features:

  - Each data set, *taken individually*, is exhaustive for some area of 
spending. No money was spent that does not appear somewhere in the list.

  - Each data set, *taken individually*, consists of non-overlapping items 
of expenditure. No item of spending appears twice in the list.

Therefore, for each data set, if we add up all the items we obtain the 
correct figure for the total spending. Similarly, if we add up all the 
spending for a particular region, department, or function, then we obtain 
the correct total for that region, department, or function.

These properties are important. A data set with these properties is 
useful; you can aggregate the data in whatever way makes the point you 
want to make. A data set without these properties is considerably less 
useful.

Many data sets
--------------

Our data sets are exhaustive and non-overlapping taken individually, but 
taken together they are not. There is no reliable way to combine them into 
a single data set that is exhaustive and non-overlapping.

I reckon this might be our *primary obstacle*.

It would be lovely to be able to present a single, unified view of all 
government spending, and gradually extend it and refine it with new data 
until it meets all conceivable needs. We should certainly strive in that 
direction, but it is already clear that the previous sentence is not an 
accurate description of the project at *any* point in the future. We will 
always have multiple data sets, probably many, and we must design for 
that.

Non-orthogonal classifiers
--------------------------

Rufus has made an excellent observation about the nature of the data sets 
we have: apart from year, the classifiers applied to the government 
spending items (e.g. department, region, function) are *not orthogonal*.

It would be nice to be able to make a pivot table with one axis for 
department, one axis for region, one axis for function, and one axis for 
year. We would then distribute the items of spending into the cells of 
this big four-dimensional table, and compute totals along every axis. 
However, this is not a model that fits the data well; we cannot 
realistically do this.

There are essentially four problems with the pivot-table model:

  - Most of the cells and totals would be zero.

  - Different data sets require different axes.

  - Not all classifiers are defined for all spending items.

  - We will want to add axes to existing data sets. Indeed, I think we 
should consider crowd sourcing new classifiers for existing data.

So, if a pivot table is the wrong model, what is right? I think the WDMMG 
prototype has the right model: one axis for year, crossed with a tree 
structure.

Aggregation is subjective
-------------------------

The WDMMG prototype presents the spending for a given year as a tree, and 
provides a lovely interface for browsing it. The question arises: what 
tree structure to use?

I posit that the answer depends on what you want to use the data for. For 
example, if you work for a lobby group that wants to argue that London 
gets preferential treatment, then you might want a tree like this:

 	+- Spending not attached to any region
 	|
 	+- Welfare
 	|  |
 	|  +- London
 	|  +- Cornwall
 	|  +- North East
 	|
 	+- Civil engineering projects
 	|  |
 	|  +- Motorways
 	|  |  |
 	|  |  +- London
 	|  |  +- Cornwall
 	|  |  +- North East
 	|  |
 	|  +- Sports facilities
 	|  |  |
 	|  |  +- Olympics
 	|  |  |
 	|  |  +- Other London
 	|  |  +- Cornwall
 	|  |  +- North East
 	|  |
 		etc...

However, if you're arguing for more spending on education, you might want 
a completely different tree:

 	+- Education
 	|  |
 	|  +- Primary
 	|  +- Secondary
 	|  +- Higher
 	|
 	+- Government debt
 	|
 	+- Military
 	|  |
 	|  +- War in Iraq
 	|  +- War in Afghanistan
 	|  +- Trident
 	|
 	+- NHS
 	|  |
 		etc...

The OKF itself might want to provide a view of the data in which the 
top-level breakdown is by region. This would serve an audience of local 
journalists, campaigners and lobbyists, who care only about their own 
region. That would require a different tree structure again.

At the absolutely most detailed level, these trees have the same spending 
items at the leaves, and as such they are different presentations of the 
same data. However, structures of the trees are quite different.

I think we should consider letting users define their own tree structures. 
I think it would be valuable to provide a service where anybody can go to 
our site, take our data, and present it the way they want it. They can 
then get a URL which they can publish on their own site.

Some will want to link to the OKF site, in order to get the credibility 
associated with the OKF branding (i.e. yes, this data is real and we 
didn't just make it up). Others might want merely to embed browsable 
diagrams generated by us into their own content. However, this is a 
presentational distinction; the impact on the data model is that we should 
not impose any particular tree, but should instead allow users to define 
their own.

Disaggregation is impossible
----------------------------

Unfortunately, many of the data sets we receive will already have been 
aggregated. We may also expect to receive the same data aggregated in two 
or more different ways. I think Rufus said that the PESA data set is an 
example.

It is basically impossible to disaggregate a data set. If you know the 
total spending for each region, for example, and also the total spending 
for each department, it is not possible to recover the spending of a 
particular department in a particular region.

This is another reason why we are going to have to manage multiple data 
sets. However, this phenomenon is more subtle. If we have two data sets, 
and we know that they are aggregations of the same (unknown) underlying 
data set, then the two data sets are partially redundant.

For example, suppose we have one data set that breaks down spending by 
department and function, and suppose we have another that breaks the same 
spending down by region and function. Then we have two ways of 
calculating the spending broken down by function alone. We can either 
start from the first data set and sum over department, or we can start 
from the second data set and sum over region. The answers should be 
identical.

It feels like the Right Thing To Do to attempt to represent this sort of 
redundancy in the data model, when we know it exists. Alternatively, we 
could just give up, and model the data sets as completely independent.

Accounting
==========

Rufus has an excellent long-term goal of munging all the data we can get 
into a proper set of accounts. To be specific, we want to use double-entry 
bookkeeping.

The data model for this kind of accounting is well known, and has been in 
use by accountants all over the world for centuries, which is obviously a 
plus. However, Rufus, John and I discovered that we all used different 
terminology for the same concepts. Therefore, I think it is worth me 
writing out the data model explicitly, as if it is unfamiliar.

There are three database tables:

  - Accounts. An account is a place where money can be. An example of an 
account is an ordinary bank account. Another example is your wallet. 
Accounts have some sort of ownership. Ownership is broadly interpreted, 
and may encompass things like earmarking for a particular purpose. It is 
also possible to have an account that "contains" money that is owed to 
you, but which you have not yet received. It is a flexible concept.

  - Transactions (which John called "events"). A transaction is an event 
that causes money to move around between accounts. An example of a 
transaction is buying something. Another example is signing a contract 
(since that can create a debt). Transactions have a date.

  - Postings (which John called "entries"). A posting records the effect of 
one transaction on one account. Postings are the items that show up on a 
bank statement. A single transaction always has postings on at least two 
accounts (e.g. source and destination) and those postings always add to 
zero. Postings have an amount.

Statements and balances
-----------------------

A "statement" is a list of all the postings against a particular account. 
If the account is closed (e.g. because we are studying historical accounts 
of some organisation that no longer exists) then the postings should all 
add up to zero. If the account is still open then the total of all the 
postings so far is the "current balance" of the account. As an 
optimisation, the current balance is often stored as a field in the 
account table, but it is formally redundant with the postings.

It is also possible to work out the balance of an account at any earlier 
time, by summing the postings up to that time. By working out the balance 
at the beginning and end of some period of time (e.g. a year), it is 
possible to draw up a statement for just that period. The statement 
consists of the initial balance, the postings during the period, and the 
final balance. The total of the postings should be equal to the difference 
between the final and initial balances.

We are likely to receive data sets in the form of statements for 
particular periods. Also, it would be useful to publish statements in that 
form.

Aggregation
-----------

Accounts can be aggregated. To merge two accounts, interleave their 
postings in date order. The effect is to add the two balances.

Transactions can also be aggregated. To merge two transactions, form the 
union of their sets of postings. Postings to the same account can then be 
merged by adding their amounts. Note that it is possible in principle to 
aggregate transactions even if they are not between the same accounts; 
this is not common.

Thus, for double-entry bookkeeping there are two distinct kinds of 
aggregation. We are likely to receive data with aggregated accounts and 
transactions, and we will probably want to publish data in that form too.

Mapping our data sets onto double-entry bookkeeping
---------------------------------------------------

The CRA, the PESA and the Spending Review data sets are not in 
double-entry bookkeeping form. They can be mapped onto it in at least the 
following ways:

1a. We could make a single account for "the government", representing all 
the money available for spending. We could then make one account for every 
combination of classifiers (other than year), representing all the money 
that has already been spent on a particular thing. Then, each item of 
spending could become a transaction from the government to somewhere else.

1b. We could choose one of the tree structures in the section "Aggregation 
is subjective" above. We could then make an account for each node in the 
tree. Then, each item of spending could become a set of transactions each 
representing one step along the path down the tree from "the government" 
(the root of the tree) to one of the leaves.

2. We could make an account for every combination of classifiers, 
representing the money earmarked for spending on a particular thing. We 
could then make a single account for "the rest of the world", representing 
all the money that has been spent. Then, each item of spending could 
become a transaction from one of the earmarking accounts to the rest of 
the world. This is the dual of 1.

3. We could choose one of the classifiers, e.g. department, and make an 
account for each department. We could then choose another classifier, e.g. 
function, and make an account for each function. Then each item of 
spending could become a transaction from a particular department to a 
particular function.

However, these mappings are obviously all wrong. 1a would merely construct 
an obfuscated version of the original data set, while teaching nobody 
anything. 1b would fail due to the subjectivity of the tree structure; we 
could not do it in a way that would work for everybody. 2 is the dual of 
1, and has the same problem, in addition to being rather unintuitive. 3 is 
doubly subjective, since we have to choose two classifiers from many. 
Furthermore, information in the classifiers that we do not choose is 
largely destroyed by mapping 3.

The problem here is that this mapping problem is ill-posed. We are looking 
at it the wrong way around.

Mapping double-entry bookkeeping onto our data sets
---------------------------------------------------

If we did manage somehow to construct a full set of accounts for the whole 
of government, we should in principle be able to reconstruct the data sets 
we have, i.e. the CRA, the PESA and the Spending Review.

To do this for e.g. the CRA, we would have to identify the boundary 
between the set of accounts which the CRA treats as "inside" the 
government, and the set of accounts which it treats as "outside" the 
government. Then, we would aggregate all the "inside" accounts together 
into one account called "the government", and aggregagate all the 
"outside" accounts together into one account called "the rest of the 
world". The transactions which cross the boundary would end up with equal 
and opposite postings in each of the two accounts. If we then make a 
statement for "the government", and erase the intial and final balances, 
we would recover the CRA data set.

To spell it out:

  - The spending items in data sets are *postings*.

  - The data sets have already been severely aggregated over *accounts*, 
though they retain good information about transactions.

  - The classifiers in the data sets are properties of *transactions*.

We could then do the same thing for the UK Local Spending Review data. 
Again we would have to separate the "inside" accounts from the "outside" 
accounts. We might find that the boundary is in a different place. Then, 
when we try to reconstruct the PESA data set we might discover yet another 
accounting boundary.

Implications for the data model
-------------------------------

I fully expect to discover that three accounting boundaries (one for each 
of our data sets) are different (but this is something we need to 
investigate). In other words, we have available to us three different 
aggregations of the same accounts.

This is a pity, because, as explained above, disaggregation is impossible. 
Each of these data sets is a *constraint* on the full double-entry 
bookkeeping accounts, but no matter how many constraints of this form we 
acquire, we will not be able to reconstruct the original accounts.

If we do want to aim for a double-entry bookkeeping model, we need to try 
to get data of a fundamentally different form. I'm not sure what it needs 
to look like - this stuff is really hard to think about!

I think this reasoning also applies to crowd-sourced accounting 
information. I have doubts that we can get any useful information about 
transactions and accounts from the crowd at all. I would be delighted to 
be persuaded otherwise.

So, while I agree with Rufus that double-entry bookkeeping is a good 
target to aim for, I can't currently see any way of attaining it. 
Therefore, I think we should aim lower, at least at first. It is only fair 
to say that Rufus and I have not yet agreed on this!

A possible data model
=====================

(This is now partly beyond the scope of what we discussed in the meeting, 
but only by my filling in some details).

I'm not going to pretend that I have everything worked out, but let me 
construct a straw man. Let's represent our (many) data sets pretty much in 
the form that we receive them. Let's abandon for the moment the ambition 
to relate different data sets to each other. Let us instead aim to provide 
the following functionality:

  - A user should be able to upload a complete new data set, which must 
satisfy the "exhaustive" and "non-overlapping" properties. They should be 
able to name it and explain what it is. They should get a (public) URL for 
it, and it should be added to an index.

  - A user should be able to define new classifiers for an existing data 
set. They should be able to go through a (possibly partial) list of 
spending items, and for each one say "that goes in such-and-such a 
category". They should then be able to save their list, name it, and 
explain what it is. They should get a (public) URL for it, and it should 
be added to an index.

  - A user should be able to define a tree structure for an existing data 
set. For each node of the tree, they need to specify which spending items 
are aggregated together to make the node. They do this by defining simple 
rules based on the classifier data. Having constructed a valid tree, they 
should be able to name it and explain what it is. They should get a 
(public) URL for it, and it should be added to an index.

  - Given a data set and a tree structure, we should be able to make a 
customised site that looks a bit like the WDMMG prototype. We should offer 
this both as a stand-alone web page with OKF branding, and also as an 
embeddable spending browser for inclusion in other web pages.

  - Given a data set and a set of classifiers, we should be able to dump 
out a CSV file with one column for each classifier, one column for date 
and one column for amount. We should probably have a primary key column 
too.

  - We should provide whatever other trendy interfaces are easy, e.g. a 
SPARQL end point. I'm going to go to this conference next week to get some 
ideas: http://www.meetup.com/Web-Of-Data/

I don't know if this feature set is realistic, or if it is consistent with 
the goals of the project. Please comment liberally.

Table structure
---------------

My straw man data model, which is based in part on something Rufus put 
together, has the following tables:

  - DataSet. Each row represents one whole data set. There are column for 
meta-data about the data set: name, description, origin.

  - SpendingItem. Each row represents one item of spending. There are 
columns for: DataSet key, date, amount. Alternative name: Posting.

  - Classifier. Each row represents a way of categorising the SpendingItems 
in a particular data set. For example, for the CRA, there would be three 
rows: Region, Department, and Function. There are columns for: DataSet 
key, name, description, origin.

  - SpendingAttribute. Each row represents the statement that a particular 
Spending item is placed in a particular class by a particular Classifier. 
There are columns for: SpendingItem key, Classifier key, class (i.e. 
classifier value). This is conceptually a triple space. We have a 
constraint that the Classifier and the SpendingItem belong to the same 
DataSet.

  - Tree. Each row represents a tree structure for some data set. There are 
columns for meta-data about the tree: DataSet key, name, description, 
origin.

  - Aggregate. Each row represents a node in a tree. There are columns for 
meta-data about the node: Tree key, name, descrption, colour(?). There is 
also a column for the parent Aggregate key. Alternative name: TreeNode.

  - AggregateAttribute. Each row represents a rule that a SpendingItem must 
obey in order to be included in the Aggregate. There are columns for: 
Aggregate key, Classifier key, class (i.e. classifier value). We have a 
constraint that the Classifier and the Aggregate must belong to the same 
DataSet.

The rule for deciding whether a given SpendingItem is included in a given 
Aggregate is that (1) for all AggregateAttributes of the Aggregate, the 
SpendingItem must have a matching SpendingAttribute, and (2) it must be 
included in the Aggregate's parent, if any. (I'm not sure this is the best 
table structure, but it's one that works, I think).

It is not clear just from the table structure what the user interface 
might look like. The users are clearly going to need something rather 
higher-level than merely a way of editing the records in the tables. I 
will leave this problem open.

Summary
=======

I have tried to explain what kind of data we've got. The three data sets I 
have encountered so far all seem to have roughly the same structure, which 
is pleasant. However, this structure is not wholly compatible with our 
goals. We would like to unify the data sets into a proper set of accounts 
but I (personally) don't think it's feasible at the moment.

What we can do relatively easily is to construct a customisable framework 
for presenting the data sets. I am imagining a straightforward 
generalisation of the WDMMG prototype to support multiple (unrelated) data 
sets, crowd-sourced classifiers, and crowd-sourced tree structures. I have 
described a data model that could serve for this reduced ambition.

I'm sure Rufus will correct me if I have misrepresented him or if 
I have forgotten something.

 	Alistair