[wdmmg-discuss] A data model
Alistair Turnbull
apt1002 at goose.minworks.co.uk
Tue Feb 16 04:19:10 UTC 2010
De-lurk
=======
Hello! I am a contractor, and I hope I will be working on the Where Does
My Money Go project for the next few months. My skills are in software
engineering.
Meeting today
=============
Earlier today I met Rufus Pollock and John Bywater. Our goal was to work
out what kind of store we're going to build for the data that we have, and
for the data that we expect to get in the future. In the rest of this post
I'm going to try to summarise what we talked about.
There are a lot of open questions, but the big one is what the data model
should be. This has to balance a lot of competing considerations:
- What data have we actually got?
- What do we want to use it for?
- If we crowd-source stuff, what kind of imperfections do we need to cope
with?
- What is technically feasible?
In a way, the data model summarises the essential features of the whole
project, so that's what we focused on today.
Data sets
=========
I'm new to this project, so I'm sure the following list is incomplete and
probably not quite accurate! Please correct and append.
- Country Regional Analysis (CRA) - All UK government spending, broken
into about 26,000 rows by about ten years. The rows are distinguished by a
number of classifiers, including department, region, and function, where
possible. Apart from year, the classifiers do not seem to be orthogonal.
- UK Local Spending Review - All spending by UK local authorities. The
spending is broken down first by the local authority responsible, then by
function. I'm not sure how many years' data we have.
- Public Expenditure Statistical Analysis (PESA) - The highest level
breakdown of Government spending. Similar to CRA but across countries and
at a coarser granularity. This data set was used for the prototype at
http://www.wheredoesmymoneygo.org/prototype/ .
- Supplementary Budget Information - Can somebody fill in a description
here?
- More?
Exhaustive and non-overlapping
------------------------------
The CRA, the PESA, and the UK Local Spending Review all have the following
common features:
- Each data set, *taken individually*, is exhaustive for some area of
spending. No money was spent that does not appear somewhere in the list.
- Each data set, *taken individually*, consists of non-overlapping items
of expenditure. No item of spending appears twice in the list.
Therefore, for each data set, if we add up all the items we obtain the
correct figure for the total spending. Similarly, if we add up all the
spending for a particular region, department, or function, then we obtain
the correct total for that region, department, or function.
These properties are important. A data set with these properties is
useful; you can aggregate the data in whatever way makes the point you
want to make. A data set without these properties is considerably less
useful.
Many data sets
--------------
Our data sets are exhaustive and non-overlapping taken individually, but
taken together they are not. There is no reliable way to combine them into
a single data set that is exhaustive and non-overlapping.
I reckon this might be our *primary obstacle*.
It would be lovely to be able to present a single, unified view of all
government spending, and gradually extend it and refine it with new data
until it meets all conceivable needs. We should certainly strive in that
direction, but it is already clear that the previous sentence is not an
accurate description of the project at *any* point in the future. We will
always have multiple data sets, probably many, and we must design for
that.
Non-orthogonal classifiers
--------------------------
Rufus has made an excellent observation about the nature of the data sets
we have: apart from year, the classifiers applied to the government
spending items (e.g. department, region, function) are *not orthogonal*.
It would be nice to be able to make a pivot table with one axis for
department, one axis for region, one axis for function, and one axis for
year. We would then distribute the items of spending into the cells of
this big four-dimensional table, and compute totals along every axis.
However, this is not a model that fits the data well; we cannot
realistically do this.
There are essentially four problems with the pivot-table model:
- Most of the cells and totals would be zero.
- Different data sets require different axes.
- Not all classifiers are defined for all spending items.
- We will want to add axes to existing data sets. Indeed, I think we
should consider crowd sourcing new classifiers for existing data.
So, if a pivot table is the wrong model, what is right? I think the WDMMG
prototype has the right model: one axis for year, crossed with a tree
structure.
Aggregation is subjective
-------------------------
The WDMMG prototype presents the spending for a given year as a tree, and
provides a lovely interface for browsing it. The question arises: what
tree structure to use?
I posit that the answer depends on what you want to use the data for. For
example, if you work for a lobby group that wants to argue that London
gets preferential treatment, then you might want a tree like this:
+- Spending not attached to any region
|
+- Welfare
| |
| +- London
| +- Cornwall
| +- North East
|
+- Civil engineering projects
| |
| +- Motorways
| | |
| | +- London
| | +- Cornwall
| | +- North East
| |
| +- Sports facilities
| | |
| | +- Olympics
| | |
| | +- Other London
| | +- Cornwall
| | +- North East
| |
etc...
However, if you're arguing for more spending on education, you might want
a completely different tree:
+- Education
| |
| +- Primary
| +- Secondary
| +- Higher
|
+- Government debt
|
+- Military
| |
| +- War in Iraq
| +- War in Afghanistan
| +- Trident
|
+- NHS
| |
etc...
The OKF itself might want to provide a view of the data in which the
top-level breakdown is by region. This would serve an audience of local
journalists, campaigners and lobbyists, who care only about their own
region. That would require a different tree structure again.
At the absolutely most detailed level, these trees have the same spending
items at the leaves, and as such they are different presentations of the
same data. However, structures of the trees are quite different.
I think we should consider letting users define their own tree structures.
I think it would be valuable to provide a service where anybody can go to
our site, take our data, and present it the way they want it. They can
then get a URL which they can publish on their own site.
Some will want to link to the OKF site, in order to get the credibility
associated with the OKF branding (i.e. yes, this data is real and we
didn't just make it up). Others might want merely to embed browsable
diagrams generated by us into their own content. However, this is a
presentational distinction; the impact on the data model is that we should
not impose any particular tree, but should instead allow users to define
their own.
Disaggregation is impossible
----------------------------
Unfortunately, many of the data sets we receive will already have been
aggregated. We may also expect to receive the same data aggregated in two
or more different ways. I think Rufus said that the PESA data set is an
example.
It is basically impossible to disaggregate a data set. If you know the
total spending for each region, for example, and also the total spending
for each department, it is not possible to recover the spending of a
particular department in a particular region.
This is another reason why we are going to have to manage multiple data
sets. However, this phenomenon is more subtle. If we have two data sets,
and we know that they are aggregations of the same (unknown) underlying
data set, then the two data sets are partially redundant.
For example, suppose we have one data set that breaks down spending by
department and function, and suppose we have another that breaks the same
spending down by region and function. Then we have two ways of
calculating the spending broken down by function alone. We can either
start from the first data set and sum over department, or we can start
from the second data set and sum over region. The answers should be
identical.
It feels like the Right Thing To Do to attempt to represent this sort of
redundancy in the data model, when we know it exists. Alternatively, we
could just give up, and model the data sets as completely independent.
Accounting
==========
Rufus has an excellent long-term goal of munging all the data we can get
into a proper set of accounts. To be specific, we want to use double-entry
bookkeeping.
The data model for this kind of accounting is well known, and has been in
use by accountants all over the world for centuries, which is obviously a
plus. However, Rufus, John and I discovered that we all used different
terminology for the same concepts. Therefore, I think it is worth me
writing out the data model explicitly, as if it is unfamiliar.
There are three database tables:
- Accounts. An account is a place where money can be. An example of an
account is an ordinary bank account. Another example is your wallet.
Accounts have some sort of ownership. Ownership is broadly interpreted,
and may encompass things like earmarking for a particular purpose. It is
also possible to have an account that "contains" money that is owed to
you, but which you have not yet received. It is a flexible concept.
- Transactions (which John called "events"). A transaction is an event
that causes money to move around between accounts. An example of a
transaction is buying something. Another example is signing a contract
(since that can create a debt). Transactions have a date.
- Postings (which John called "entries"). A posting records the effect of
one transaction on one account. Postings are the items that show up on a
bank statement. A single transaction always has postings on at least two
accounts (e.g. source and destination) and those postings always add to
zero. Postings have an amount.
Statements and balances
-----------------------
A "statement" is a list of all the postings against a particular account.
If the account is closed (e.g. because we are studying historical accounts
of some organisation that no longer exists) then the postings should all
add up to zero. If the account is still open then the total of all the
postings so far is the "current balance" of the account. As an
optimisation, the current balance is often stored as a field in the
account table, but it is formally redundant with the postings.
It is also possible to work out the balance of an account at any earlier
time, by summing the postings up to that time. By working out the balance
at the beginning and end of some period of time (e.g. a year), it is
possible to draw up a statement for just that period. The statement
consists of the initial balance, the postings during the period, and the
final balance. The total of the postings should be equal to the difference
between the final and initial balances.
We are likely to receive data sets in the form of statements for
particular periods. Also, it would be useful to publish statements in that
form.
Aggregation
-----------
Accounts can be aggregated. To merge two accounts, interleave their
postings in date order. The effect is to add the two balances.
Transactions can also be aggregated. To merge two transactions, form the
union of their sets of postings. Postings to the same account can then be
merged by adding their amounts. Note that it is possible in principle to
aggregate transactions even if they are not between the same accounts;
this is not common.
Thus, for double-entry bookkeeping there are two distinct kinds of
aggregation. We are likely to receive data with aggregated accounts and
transactions, and we will probably want to publish data in that form too.
Mapping our data sets onto double-entry bookkeeping
---------------------------------------------------
The CRA, the PESA and the Spending Review data sets are not in
double-entry bookkeeping form. They can be mapped onto it in at least the
following ways:
1a. We could make a single account for "the government", representing all
the money available for spending. We could then make one account for every
combination of classifiers (other than year), representing all the money
that has already been spent on a particular thing. Then, each item of
spending could become a transaction from the government to somewhere else.
1b. We could choose one of the tree structures in the section "Aggregation
is subjective" above. We could then make an account for each node in the
tree. Then, each item of spending could become a set of transactions each
representing one step along the path down the tree from "the government"
(the root of the tree) to one of the leaves.
2. We could make an account for every combination of classifiers,
representing the money earmarked for spending on a particular thing. We
could then make a single account for "the rest of the world", representing
all the money that has been spent. Then, each item of spending could
become a transaction from one of the earmarking accounts to the rest of
the world. This is the dual of 1.
3. We could choose one of the classifiers, e.g. department, and make an
account for each department. We could then choose another classifier, e.g.
function, and make an account for each function. Then each item of
spending could become a transaction from a particular department to a
particular function.
However, these mappings are obviously all wrong. 1a would merely construct
an obfuscated version of the original data set, while teaching nobody
anything. 1b would fail due to the subjectivity of the tree structure; we
could not do it in a way that would work for everybody. 2 is the dual of
1, and has the same problem, in addition to being rather unintuitive. 3 is
doubly subjective, since we have to choose two classifiers from many.
Furthermore, information in the classifiers that we do not choose is
largely destroyed by mapping 3.
The problem here is that this mapping problem is ill-posed. We are looking
at it the wrong way around.
Mapping double-entry bookkeeping onto our data sets
---------------------------------------------------
If we did manage somehow to construct a full set of accounts for the whole
of government, we should in principle be able to reconstruct the data sets
we have, i.e. the CRA, the PESA and the Spending Review.
To do this for e.g. the CRA, we would have to identify the boundary
between the set of accounts which the CRA treats as "inside" the
government, and the set of accounts which it treats as "outside" the
government. Then, we would aggregate all the "inside" accounts together
into one account called "the government", and aggregagate all the
"outside" accounts together into one account called "the rest of the
world". The transactions which cross the boundary would end up with equal
and opposite postings in each of the two accounts. If we then make a
statement for "the government", and erase the intial and final balances,
we would recover the CRA data set.
To spell it out:
- The spending items in data sets are *postings*.
- The data sets have already been severely aggregated over *accounts*,
though they retain good information about transactions.
- The classifiers in the data sets are properties of *transactions*.
We could then do the same thing for the UK Local Spending Review data.
Again we would have to separate the "inside" accounts from the "outside"
accounts. We might find that the boundary is in a different place. Then,
when we try to reconstruct the PESA data set we might discover yet another
accounting boundary.
Implications for the data model
-------------------------------
I fully expect to discover that three accounting boundaries (one for each
of our data sets) are different (but this is something we need to
investigate). In other words, we have available to us three different
aggregations of the same accounts.
This is a pity, because, as explained above, disaggregation is impossible.
Each of these data sets is a *constraint* on the full double-entry
bookkeeping accounts, but no matter how many constraints of this form we
acquire, we will not be able to reconstruct the original accounts.
If we do want to aim for a double-entry bookkeeping model, we need to try
to get data of a fundamentally different form. I'm not sure what it needs
to look like - this stuff is really hard to think about!
I think this reasoning also applies to crowd-sourced accounting
information. I have doubts that we can get any useful information about
transactions and accounts from the crowd at all. I would be delighted to
be persuaded otherwise.
So, while I agree with Rufus that double-entry bookkeeping is a good
target to aim for, I can't currently see any way of attaining it.
Therefore, I think we should aim lower, at least at first. It is only fair
to say that Rufus and I have not yet agreed on this!
A possible data model
=====================
(This is now partly beyond the scope of what we discussed in the meeting,
but only by my filling in some details).
I'm not going to pretend that I have everything worked out, but let me
construct a straw man. Let's represent our (many) data sets pretty much in
the form that we receive them. Let's abandon for the moment the ambition
to relate different data sets to each other. Let us instead aim to provide
the following functionality:
- A user should be able to upload a complete new data set, which must
satisfy the "exhaustive" and "non-overlapping" properties. They should be
able to name it and explain what it is. They should get a (public) URL for
it, and it should be added to an index.
- A user should be able to define new classifiers for an existing data
set. They should be able to go through a (possibly partial) list of
spending items, and for each one say "that goes in such-and-such a
category". They should then be able to save their list, name it, and
explain what it is. They should get a (public) URL for it, and it should
be added to an index.
- A user should be able to define a tree structure for an existing data
set. For each node of the tree, they need to specify which spending items
are aggregated together to make the node. They do this by defining simple
rules based on the classifier data. Having constructed a valid tree, they
should be able to name it and explain what it is. They should get a
(public) URL for it, and it should be added to an index.
- Given a data set and a tree structure, we should be able to make a
customised site that looks a bit like the WDMMG prototype. We should offer
this both as a stand-alone web page with OKF branding, and also as an
embeddable spending browser for inclusion in other web pages.
- Given a data set and a set of classifiers, we should be able to dump
out a CSV file with one column for each classifier, one column for date
and one column for amount. We should probably have a primary key column
too.
- We should provide whatever other trendy interfaces are easy, e.g. a
SPARQL end point. I'm going to go to this conference next week to get some
ideas: http://www.meetup.com/Web-Of-Data/
I don't know if this feature set is realistic, or if it is consistent with
the goals of the project. Please comment liberally.
Table structure
---------------
My straw man data model, which is based in part on something Rufus put
together, has the following tables:
- DataSet. Each row represents one whole data set. There are column for
meta-data about the data set: name, description, origin.
- SpendingItem. Each row represents one item of spending. There are
columns for: DataSet key, date, amount. Alternative name: Posting.
- Classifier. Each row represents a way of categorising the SpendingItems
in a particular data set. For example, for the CRA, there would be three
rows: Region, Department, and Function. There are columns for: DataSet
key, name, description, origin.
- SpendingAttribute. Each row represents the statement that a particular
Spending item is placed in a particular class by a particular Classifier.
There are columns for: SpendingItem key, Classifier key, class (i.e.
classifier value). This is conceptually a triple space. We have a
constraint that the Classifier and the SpendingItem belong to the same
DataSet.
- Tree. Each row represents a tree structure for some data set. There are
columns for meta-data about the tree: DataSet key, name, description,
origin.
- Aggregate. Each row represents a node in a tree. There are columns for
meta-data about the node: Tree key, name, descrption, colour(?). There is
also a column for the parent Aggregate key. Alternative name: TreeNode.
- AggregateAttribute. Each row represents a rule that a SpendingItem must
obey in order to be included in the Aggregate. There are columns for:
Aggregate key, Classifier key, class (i.e. classifier value). We have a
constraint that the Classifier and the Aggregate must belong to the same
DataSet.
The rule for deciding whether a given SpendingItem is included in a given
Aggregate is that (1) for all AggregateAttributes of the Aggregate, the
SpendingItem must have a matching SpendingAttribute, and (2) it must be
included in the Aggregate's parent, if any. (I'm not sure this is the best
table structure, but it's one that works, I think).
It is not clear just from the table structure what the user interface
might look like. The users are clearly going to need something rather
higher-level than merely a way of editing the records in the tables. I
will leave this problem open.
Summary
=======
I have tried to explain what kind of data we've got. The three data sets I
have encountered so far all seem to have roughly the same structure, which
is pleasant. However, this structure is not wholly compatible with our
goals. We would like to unify the data sets into a proper set of accounts
but I (personally) don't think it's feasible at the moment.
What we can do relatively easily is to construct a customisable framework
for presenting the data sets. I am imagining a straightforward
generalisation of the WDMMG prototype to support multiple (unrelated) data
sets, crowd-sourced classifiers, and crowd-sourced tree structures. I have
described a data model that could serve for this reduced ambition.
I'm sure Rufus will correct me if I have misrepresented him or if
I have forgotten something.
Alistair
More information about the openspending
mailing list