[wdmmg-discuss] A data model

Tue Feb 16 12:59:57 UTC 2010

(thanks for forwarding this to me Lisa, as I wasn't on the list)

This is a good start for discussion.

However, I don't think Alistair's straw model is general enough.
Specifically, I think exhaustive, non-overlapping data sets will be
rare. Maybe you've got them all, or nearly all, already.

I think the model needs to be able to store a broader range of facts
about public finance, with additional separate facts about any known
exhaustiveness/non-overlapping properties.

Some example such datasets:

1. Amount spend on on-shore wind research. The tables at the end of
this PDF show various research projects and amounts of money. I'd like
to be able to make it more structured, and upload it into WDDMG.
    http://www.whatdotheyknow.com/request/23292/response/60175/attach/3/Lisa%20Evans.pdf
That's from an FOI request by Lisa Evans:
    http://www.whatdotheyknow.com/request/yearly_investment_in_uk_on_shore

2. Ministerial salaries. This is probably not the best source, as it
might need more detail as to how much a minister was actually paid
(given they were not minister for integral numbers of years). 
    http://www.parliament.uk/documents/upload/M06.pdf
But I would like to be able to get only slightly more info than this
PDF, and upload facts about it into WDMMG.

3. Cost of the My Cleveland crime map. This is a documented set of
financial facts about one particular contract. I'd like to upload the
existence of that specific contract and its amount as a fact into
WDMMG.
    http://www.whatdotheyknow.com/request/23592/response/70132/attach/html/2/contract.pdf.html
>From an FOI request by Julian Todd:
    http://www.whatdotheyknow.com/request/cost_and_contracts_for_myclevela
You can imagine I might want to tag it, and upload similar facts for
other police authorities. I should also be forced to provide my
source.

4. The historic bank statements of one particular account in one
Government department. Suppose it is the account of a Regional
Development Agency which is used to pay for projects. I should be able
to upload all the transactions on the statements. I should be able to
add facts saying that they together definitely represent all the money
spent in a particular category (RDA project spending). I should be
able to cross link them as corresponding to particular contracts I
have already uploaded facts about.

What I'm describing is a "structured Wikipedia / Open Street Map for
facts about Government money". This is quite different from a way of
visualising national data sets in fancy ways. It is also quite
different from recreating double entry book keeping accounts. It would
allow either, though.

My guess is that complete data sets, and complete details about
accounts will be relatively rare. But nevertheless, we can still
provide and store useful information where we do have it, and let
people search and find and add to that information.

Keen researchers will make the data complete in a particular area. We
will find it doesn't quite add up, and need to investigate why. When
it is understood fully, the metadata would crosslink the low level
transactions to higher level facts (e.g. from CRA).

Hope that helps. It is ambitious and a bit wild, I know. But I don't
think the low level structure need be that bad. I'm guessing something
akin to Open Street Map's nodes and lines (instead, maybe money and
accounts/bodies), with a set of agreed and evolving tags on top of
that will do.

Francis

On Tue, Feb 16, 2010 at 12:09:37PM +0000, Lisa Evans wrote:
> ---------- Forwarded message ----------
> Date: Tue, 16 Feb 2010 04:19:10 +0000 (GMT)
> From: Alistair Turnbull <apt1002 at goose.minworks.co.uk>
> Reply-To: Alistair Turnbull <apt1002 at mupsych.org>
> To: Where does my money go? <wdmmg-discuss at lists.okfn.org>
> Subject: [wdmmg-discuss] A data model
> 
> De-lurk
> =======
> 
> Hello! I am a contractor, and I hope I will be working on the Where
> Does My Money Go project for the next few months. My skills are in
> software engineering.
> 
> 
> 
> Meeting today
> =============
> 
> Earlier today I met Rufus Pollock and John Bywater. Our goal was to
> work out what kind of store we're going to build for the data that
> we have, and for the data that we expect to get in the future. In
> the rest of this post I'm going to try to summarise what we talked
> about.
> 
> There are a lot of open questions, but the big one is what the data
> model should be. This has to balance a lot of competing
> considerations:
> 
>  - What data have we actually got?
> 
>  - What do we want to use it for?
> 
>  - If we crowd-source stuff, what kind of imperfections do we need
> to cope with?
> 
>  - What is technically feasible?
> 
> In a way, the data model summarises the essential features of the
> whole project, so that's what we focused on today.
> 
> 
> 
> Data sets
> =========
> 
> I'm new to this project, so I'm sure the following list is
> incomplete and probably not quite accurate! Please correct and
> append.
> 
>  - Country Regional Analysis (CRA) - All UK government spending,
> broken into about 26,000 rows by about ten years. The rows are
> distinguished by a number of classifiers, including department,
> region, and function, where possible. Apart from year, the
> classifiers do not seem to be orthogonal.
> 
>  - UK Local Spending Review - All spending by UK local authorities.
> The spending is broken down first by the local authority
> responsible, then by function. I'm not sure how many years' data we
> have.
> 
>  - Public Expenditure Statistical Analysis (PESA)  - The highest
> level breakdown of Government spending. Similar to CRA but across
> countries and at a coarser granularity. This data set was used for
> the prototype at http://www.wheredoesmymoneygo.org/prototype/ .
> 
>  - Supplementary Budget Information - Can somebody fill in a description here?
> 
>  - More?
> 
> Exhaustive and non-overlapping
> ------------------------------
> 
> The CRA, the PESA, and the UK Local Spending Review all have the
> following common features:
> 
>  - Each data set, *taken individually*, is exhaustive for some area
> of spending. No money was spent that does not appear somewhere in
> the list.
> 
>  - Each data set, *taken individually*, consists of non-overlapping
> items of expenditure. No item of spending appears twice in the list.
> 
> Therefore, for each data set, if we add up all the items we obtain
> the correct figure for the total spending. Similarly, if we add up
> all the spending for a particular region, department, or function,
> then we obtain the correct total for that region, department, or
> function.
> 
> These properties are important. A data set with these properties is
> useful; you can aggregate the data in whatever way makes the point
> you want to make. A data set without these properties is
> considerably less useful.
> 
> Many data sets
> --------------
> 
> Our data sets are exhaustive and non-overlapping taken individually,
> but taken together they are not. There is no reliable way to combine
> them into a single data set that is exhaustive and non-overlapping.
> 
> I reckon this might be our *primary obstacle*.
> 
> It would be lovely to be able to present a single, unified view of
> all government spending, and gradually extend it and refine it with
> new data until it meets all conceivable needs. We should certainly
> strive in that direction, but it is already clear that the previous
> sentence is not an accurate description of the project at *any*
> point in the future. We will always have multiple data sets,
> probably many, and we must design for that.
> 
> Non-orthogonal classifiers
> --------------------------
> 
> Rufus has made an excellent observation about the nature of the data
> sets we have: apart from year, the classifiers applied to the
> government spending items (e.g. department, region, function) are
> *not orthogonal*.
> 
> It would be nice to be able to make a pivot table with one axis for
> department, one axis for region, one axis for function, and one axis
> for year. We would then distribute the items of spending into the
> cells of this big four-dimensional table, and compute totals along
> every axis. However, this is not a model that fits the data well; we
> cannot realistically do this.
> 
> There are essentially four problems with the pivot-table model:
> 
>  - Most of the cells and totals would be zero.
> 
>  - Different data sets require different axes.
> 
>  - Not all classifiers are defined for all spending items.
> 
>  - We will want to add axes to existing data sets. Indeed, I think
> we should consider crowd sourcing new classifiers for existing data.
> 
> So, if a pivot table is the wrong model, what is right? I think the
> WDMMG prototype has the right model: one axis for year, crossed with
> a tree structure.
> 
> Aggregation is subjective
> -------------------------
> 
> The WDMMG prototype presents the spending for a given year as a
> tree, and provides a lovely interface for browsing it. The question
> arises: what tree structure to use?
> 
> I posit that the answer depends on what you want to use the data
> for. For example, if you work for a lobby group that wants to argue
> that London gets preferential treatment, then you might want a tree
> like this:
> 
> 	 +- Spending not attached to any region
> 	|
> 	 +- Welfare
> 	| |
> 	|   +- London
> 	|   +- Cornwall
> 	|   +- North East
> 	|
> 	 +- Civil engineering projects
> 	| |
> 	|   +- Motorways
> 	| | |
> 	| |   +- London
> 	| |   +- Cornwall
> 	| |   +- North East
> 	| |
> 	|   +- Sports facilities
> 	| | |
> 	| |   +- Olympics
> 	| | |
> 	| |   +- Other London
> 	| |   +- Cornwall
> 	| |   +- North East
> 	| |
> 		 etc...
> 
> However, if you're arguing for more spending on education, you might
> want a completely different tree:
> 
> 	 +- Education
> 	| |
> 	|   +- Primary
> 	|   +- Secondary
> 	|   +- Higher
> 	|
> 	 +- Government debt
> 	|
> 	 +- Military
> 	| |
> 	|   +- War in Iraq
> 	|   +- War in Afghanistan
> 	|   +- Trident
> 	|
> 	 +- NHS
> 	| |
> 		 etc...
> 
> The OKF itself might want to provide a view of the data in which the
> top-level breakdown is by region. This would serve an audience of
> local journalists, campaigners and lobbyists, who care only about
> their own region. That would require a different tree structure
> again.
> 
> At the absolutely most detailed level, these trees have the same
> spending items at the leaves, and as such they are different
> presentations of the same data. However, structures of the trees are
> quite different.
> 
> I think we should consider letting users define their own tree
> structures. I think it would be valuable to provide a service where
> anybody can go to our site, take our data, and present it the way
> they want it. They can then get a URL which they can publish on
> their own site.
> 
> Some will want to link to the OKF site, in order to get the
> credibility associated with the OKF branding (i.e. yes, this data is
> real and we didn't just make it up). Others might want merely to
> embed browsable diagrams generated by us into their own content.
> However, this is a presentational distinction; the impact on the
> data model is that we should not impose any particular tree, but
> should instead allow users to define their own.
> 
> Disaggregation is impossible
> ----------------------------
> 
> Unfortunately, many of the data sets we receive will already have
> been aggregated. We may also expect to receive the same data
> aggregated in two or more different ways. I think Rufus said that
> the PESA data set is an example.
> 
> It is basically impossible to disaggregate a data set. If you know
> the total spending for each region, for example, and also the total
> spending for each department, it is not possible to recover the
> spending of a particular department in a particular region.
> 
> This is another reason why we are going to have to manage multiple
> data sets. However, this phenomenon is more subtle. If we have two
> data sets, and we know that they are aggregations of the same
> (unknown) underlying data set, then the two data sets are partially
> redundant.
> 
> For example, suppose we have one data set that breaks down spending
> by department and function, and suppose we have another that breaks
> the same spending down by region and function. Then we have two ways
> of calculating the spending broken down by function alone. We can
> either start from the first data set and sum over department, or we
> can start from the second data set and sum over region. The answers
> should be identical.
> 
> It feels like the Right Thing To Do to attempt to represent this
> sort of redundancy in the data model, when we know it exists.
> Alternatively, we could just give up, and model the data sets as
> completely independent.
> 
> 
> 
> Accounting
> ==========
> 
> Rufus has an excellent long-term goal of munging all the data we can
> get into a proper set of accounts. To be specific, we want to use
> double-entry bookkeeping.
> 
> The data model for this kind of accounting is well known, and has
> been in use by accountants all over the world for centuries, which
> is obviously a plus. However, Rufus, John and I discovered that we
> all used different terminology for the same concepts. Therefore, I
> think it is worth me writing out the data model explicitly, as if it
> is unfamiliar.
> 
> There are three database tables:
> 
>  - Accounts. An account is a place where money can be. An example of
> an account is an ordinary bank account. Another example is your
> wallet. Accounts have some sort of ownership. Ownership is broadly
> interpreted, and may encompass things like earmarking for a
> particular purpose. It is also possible to have an account that
> "contains" money that is owed to you, but which you have not yet
> received. It is a flexible concept.
> 
>  - Transactions (which John called "events"). A transaction is an
> event that causes money to move around between accounts. An example
> of a transaction is buying something. Another example is signing a
> contract (since that can create a debt). Transactions have a date.
> 
>  - Postings (which John called "entries"). A posting records the
> effect of one transaction on one account. Postings are the items
> that show up on a bank statement. A single transaction always has
> postings on at least two accounts (e.g. source and destination) and
> those postings always add to zero. Postings have an amount.
> 
> Statements and balances
> -----------------------
> 
> A "statement" is a list of all the postings against a particular
> account. If the account is closed (e.g. because we are studying
> historical accounts of some organisation that no longer exists) then
> the postings should all add up to zero. If the account is still open
> then the total of all the postings so far is the "current balance"
> of the account. As an optimisation, the current balance is often
> stored as a field in the account table, but it is formally redundant
> with the postings.
> 
> It is also possible to work out the balance of an account at any
> earlier time, by summing the postings up to that time. By working
> out the balance at the beginning and end of some period of time
> (e.g. a year), it is possible to draw up a statement for just that
> period. The statement consists of the initial balance, the postings
> during the period, and the final balance. The total of the postings
> should be equal to the difference between the final and initial
> balances.
> 
> We are likely to receive data sets in the form of statements for
> particular periods. Also, it would be useful to publish statements
> in that form.
> 
> Aggregation
> -----------
> 
> Accounts can be aggregated. To merge two accounts, interleave their
> postings in date order. The effect is to add the two balances.
> 
> Transactions can also be aggregated. To merge two transactions, form
> the union of their sets of postings. Postings to the same account
> can then be merged by adding their amounts. Note that it is possible
> in principle to aggregate transactions even if they are not between
> the same accounts; this is not common.
> 
> Thus, for double-entry bookkeeping there are two distinct kinds of
> aggregation. We are likely to receive data with aggregated accounts
> and transactions, and we will probably want to publish data in that
> form too.
> 
> Mapping our data sets onto double-entry bookkeeping
> ---------------------------------------------------
> 
> The CRA, the PESA and the Spending Review data sets are not in
> double-entry bookkeeping form. They can be mapped onto it in at
> least the following ways:
> 
> 1a. We could make a single account for "the government",
> representing all the money available for spending. We could then
> make one account for every combination of classifiers (other than
> year), representing all the money that has already been spent on a
> particular thing. Then, each item of spending could become a
> transaction from the government to somewhere else.
> 
> 1b. We could choose one of the tree structures in the section
> "Aggregation is subjective" above. We could then make an account for
> each node in the tree. Then, each item of spending could become a
> set of transactions each representing one step along the path down
> the tree from "the government" (the root of the tree) to one of the
> leaves.
> 
> 2. We could make an account for every combination of classifiers,
> representing the money earmarked for spending on a particular thing.
> We could then make a single account for "the rest of the world",
> representing all the money that has been spent. Then, each item of
> spending could become a transaction from one of the earmarking
> accounts to the rest of the world. This is the dual of 1.
> 
> 3. We could choose one of the classifiers, e.g. department, and make
> an account for each department. We could then choose another
> classifier, e.g. function, and make an account for each function.
> Then each item of spending could become a transaction from a
> particular department to a particular function.
> 
> However, these mappings are obviously all wrong. 1a would merely
> construct an obfuscated version of the original data set, while
> teaching nobody anything. 1b would fail due to the subjectivity of
> the tree structure; we could not do it in a way that would work for
> everybody. 2 is the dual of 1, and has the same problem, in addition
> to being rather unintuitive. 3 is doubly subjective, since we have
> to choose two classifiers from many. Furthermore, information in the
> classifiers that we do not choose is largely destroyed by mapping 3.
> 
> The problem here is that this mapping problem is ill-posed. We are
> looking at it the wrong way around.
> 
> Mapping double-entry bookkeeping onto our data sets
> ---------------------------------------------------
> 
> If we did manage somehow to construct a full set of accounts for the
> whole of government, we should in principle be able to reconstruct
> the data sets we have, i.e. the CRA, the PESA and the Spending
> Review.
> 
> To do this for e.g. the CRA, we would have to identify the boundary
> between the set of accounts which the CRA treats as "inside" the
> government, and the set of accounts which it treats as "outside" the
> government. Then, we would aggregate all the "inside" accounts
> together into one account called "the government", and aggregagate
> all the "outside" accounts together into one account called "the
> rest of the world". The transactions which cross the boundary would
> end up with equal and opposite postings in each of the two accounts.
> If we then make a statement for "the government", and erase the
> intial and final balances, we would recover the CRA data set.
> 
> To spell it out:
> 
>  - The spending items in data sets are *postings*.
> 
>  - The data sets have already been severely aggregated over
> *accounts*, though they retain good information about transactions.
> 
>  - The classifiers in the data sets are properties of *transactions*.
> 
> We could then do the same thing for the UK Local Spending Review
> data. Again we would have to separate the "inside" accounts from the
> "outside" accounts. We might find that the boundary is in a
> different place. Then, when we try to reconstruct the PESA data set
> we might discover yet another accounting boundary.
> 
> Implications for the data model
> -------------------------------
> 
> I fully expect to discover that three accounting boundaries (one for
> each of our data sets) are different (but this is something we need
> to investigate). In other words, we have available to us three
> different aggregations of the same accounts.
> 
> This is a pity, because, as explained above, disaggregation is
> impossible. Each of these data sets is a *constraint* on the full
> double-entry bookkeeping accounts, but no matter how many
> constraints of this form we acquire, we will not be able to
> reconstruct the original accounts.
> 
> If we do want to aim for a double-entry bookkeeping model, we need
> to try to get data of a fundamentally different form. I'm not sure
> what it needs to look like - this stuff is really hard to think
> about!
> 
> I think this reasoning also applies to crowd-sourced accounting
> information. I have doubts that we can get any useful information
> about transactions and accounts from the crowd at all. I would be
> delighted to be persuaded otherwise.
> 
> So, while I agree with Rufus that double-entry bookkeeping is a good
> target to aim for, I can't currently see any way of attaining it.
> Therefore, I think we should aim lower, at least at first. It is
> only fair to say that Rufus and I have not yet agreed on this!
> 
> 
> 
> A possible data model
> =====================
> 
> (This is now partly beyond the scope of what we discussed in the
> meeting, but only by my filling in some details).
> 
> I'm not going to pretend that I have everything worked out, but let
> me construct a straw man. Let's represent our (many) data sets
> pretty much in the form that we receive them. Let's abandon for the
> moment the ambition to relate different data sets to each other. Let
> us instead aim to provide the following functionality:
> 
>  - A user should be able to upload a complete new data set, which
> must satisfy the "exhaustive" and "non-overlapping" properties. They
> should be able to name it and explain what it is. They should get a
> (public) URL for it, and it should be added to an index.
> 
>  - A user should be able to define new classifiers for an existing
> data set. They should be able to go through a (possibly partial)
> list of spending items, and for each one say "that goes in
> such-and-such a category". They should then be able to save their
> list, name it, and explain what it is. They should get a (public)
> URL for it, and it should be added to an index.
> 
>  - A user should be able to define a tree structure for an existing
> data set. For each node of the tree, they need to specify which
> spending items are aggregated together to make the node. They do
> this by defining simple rules based on the classifier data. Having
> constructed a valid tree, they should be able to name it and explain
> what it is. They should get a (public) URL for it, and it should be
> added to an index.
> 
>  - Given a data set and a tree structure, we should be able to make
> a customised site that looks a bit like the WDMMG prototype. We
> should offer this both as a stand-alone web page with OKF branding,
> and also as an embeddable spending browser for inclusion in other
> web pages.
> 
>  - Given a data set and a set of classifiers, we should be able to
> dump out a CSV file with one column for each classifier, one column
> for date and one column for amount. We should probably have a
> primary key column too.
> 
>  - We should provide whatever other trendy interfaces are easy, e.g.
> a SPARQL end point. I'm going to go to this conference next week to
> get some ideas: http://www.meetup.com/Web-Of-Data/
> 
> I don't know if this feature set is realistic, or if it is
> consistent with the goals of the project. Please comment liberally.
> 
> Table structure
> ---------------
> 
> My straw man data model, which is based in part on something Rufus
> put together, has the following tables:
> 
>  - DataSet. Each row represents one whole data set. There are column
> for meta-data about the data set: name, description, origin.
> 
>  - SpendingItem. Each row represents one item of spending. There are
> columns for: DataSet key, date, amount. Alternative name: Posting.
> 
>  - Classifier. Each row represents a way of categorising the
> SpendingItems in a particular data set. For example, for the CRA,
> there would be three rows: Region, Department, and Function. There
> are columns for: DataSet key, name, description, origin.
> 
>  - SpendingAttribute. Each row represents the statement that a
> particular Spending item is placed in a particular class by a
> particular Classifier. There are columns for: SpendingItem key,
> Classifier key, class (i.e. classifier value). This is conceptually
> a triple space. We have a constraint that the Classifier and the
> SpendingItem belong to the same DataSet.
> 
>  - Tree. Each row represents a tree structure for some data set.
> There are columns for meta-data about the tree: DataSet key, name,
> description, origin.
> 
>  - Aggregate. Each row represents a node in a tree. There are
> columns for meta-data about the node: Tree key, name, descrption,
> colour(?). There is also a column for the parent Aggregate key.
> Alternative name: TreeNode.
> 
>  - AggregateAttribute. Each row represents a rule that a
> SpendingItem must obey in order to be included in the Aggregate.
> There are columns for: Aggregate key, Classifier key, class (i.e.
> classifier value). We have a constraint that the Classifier and the
> Aggregate must belong to the same DataSet.
> 
> The rule for deciding whether a given SpendingItem is included in a
> given Aggregate is that (1) for all AggregateAttributes of the
> Aggregate, the SpendingItem must have a matching SpendingAttribute,
> and (2) it must be included in the Aggregate's parent, if any. (I'm
> not sure this is the best table structure, but it's one that works,
> I think).
> 
> It is not clear just from the table structure what the user
> interface might look like. The users are clearly going to need
> something rather higher-level than merely a way of editing the
> records in the tables. I will leave this problem open.
> 
> 
> 
> Summary
> =======
> 
> I have tried to explain what kind of data we've got. The three data
> sets I have encountered so far all seem to have roughly the same
> structure, which is pleasant. However, this structure is not wholly
> compatible with our goals. We would like to unify the data sets into
> a proper set of accounts but I (personally) don't think it's
> feasible at the moment.
> 
> What we can do relatively easily is to construct a customisable
> framework for presenting the data sets. I am imagining a
> straightforward generalisation of the WDMMG prototype to support
> multiple (unrelated) data sets, crowd-sourced classifiers, and
> crowd-sourced tree structures. I have described a data model that
> could serve for this reduced ambition.
> 
> I'm sure Rufus will correct me if I have misrepresented him or if I
> have forgotten something.
> 
> 	Alistair
> 
> _______________________________________________
> wdmmg-discuss mailing list
> wdmmg-discuss at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/wdmmg-discuss
> 
> 

-- 
Help beat party propaganda at the next election - www.democracyclub.org.uk