[wdmmg-discuss] A data model

Mon Mar 1 10:38:32 UTC 2010

Dear Alistair,

Thank-you for such an excellently detailed analysis. Comments
interleaved below with plentiful snipping to save on attention :)

I should note that I've also updated our store spec page [2] in light
of responding to this email.

[2]: <http://wiki.okfn.org/p/Where_Does_My_Money_Go/Store>

Regards,

Rufus

[...]

> Data sets
> =========
>
> I'm new to this project, so I'm sure the following list is incomplete and
> probably not quite accurate! Please correct and append.

[...]

Thanks for summarizing Alistair. "Home page" for all data research work is here:

<http://wiki.okfn.org/p/Where_Does_My_Money_Go/Data>

With an overview here:

<http://wiki.okfn.org/p/Where_Does_My_Money_Go/Data/Overview>

> Exhaustive and non-overlapping
> ------------------------------

In my original design document [1] I described these terms as
"complete" and "valid". I think I prefer "complete" to "exhaustive"
but "non-overlapping" to "valid". The two properties together could be
termed a partition -- a partition being a division of a set into
subsets such that each element occurs in one and only one subset).

[1]: http://lists.okfn.org/pipermail/okfn-help/2009-October/000361.html

> The CRA, the PESA, and the UK Local Spending Review all have the following
> common features:
>
>   - Each data set, *taken individually*, is exhaustive for some area of
> spending. No money was spent that does not appear somewhere in the list.
>
>   - Each data set, *taken individually*, consists of non-overlapping items
> of expenditure. No item of spending appears twice in the list.

[...]

> These properties are important. A data set with these properties is
> useful; you can aggregate the data in whatever way makes the point you
> want to make. A data set without these properties is considerably less
> useful.

Agreed.

> Many data sets
> --------------
>
> Our data sets are exhaustive and non-overlapping taken individually, but
> taken together they are not. There is no reliable way to combine them into
> a single data set that is exhaustive and non-overlapping.
>
> I reckon this might be our *primary obstacle*.

I term this the "mapping problem". Given two sets of distinct
"Accounts" and "Transactions/Postings" (that are partitioned) can we
map them onto a third set of Accounts and Transactions that is also
partitioned.

Clearly the answer must be no, in general. However I don't think we
need to confront the general problem. However, rather than go into
this more, since this isn't required at stage 1 let's defer this for
the time being ...

[...]

> Aggregation is subjective
> -------------------------
>
> The WDMMG prototype presents the spending for a given year as a tree, and
> provides a lovely interface for browsing it. The question arises: what
> tree structure to use?

[...]

I agree with you that allowing people to aggregate in different ways
is useful. However, it is probably a relatively low priority
requirement at this stage.

> Disaggregation is impossible
> ----------------------------

In [1] I referred to this as the "Reversed Engineering" problem. Put
in its simplest terms it could be phrased as: given the sums of all
columns and all rows of a matrix reconstruct the matrix.

Clearly this is impossible (you can't get m*n items about of m+n - at
least for m,n>2). Of course, one may be able to do *some*
disaggregation. Again, however, I'm not sure we need spend too much
effort on this problem now -- I don't think it is crucial to the
design questions we have ...

[...]

> Accounting
> ==========
>
> Rufus has an excellent long-term goal of munging all the data we can get
> into a proper set of accounts. To be specific, we want to use double-entry
> bookkeeping.

I should emphasize that this isn't a long term goal. The domain model
of double-entry (Accounts + Transaction + Posting - see picture on
store page [2]) is very simple but will also accommodate complex
information and very simple information. As such I think it is a very
good "lowest-common-denominator" for all our work.

[2]: http://wiki.okfn.org/p/Where_Does_My_Money_Go/Store

[... description of double-entry model]

> Statements and balances
> -----------------------
>
> A "statement" is a list of all the postings against a particular account.
> If the account is closed (e.g. because we are studying historical accounts
> of some organisation that no longer exists) then the postings should all
> add up to zero. If the account is still open then the total of all the
> postings so far is the "current balance" of the account. As an
> optimisation, the current balance is often stored as a field in the
> account table, but it is formally redundant with the postings.

[...]

> We are likely to receive data sets in the form of statements for
> particular periods. Also, it would be useful to publish statements in that
> form.

Note that, when no detail of individual transactions is available and
one knows the source of funds (e.g. central govt!), a statement (or
more correctly the difference between start and end balance) can be
seen as equivalent to a single transaction/posting from the source
account into the statement account.

> Aggregation
> -----------

[...]

> Mapping our data sets onto double-entry bookkeeping
> ---------------------------------------------------
>
> The CRA, the PESA and the Spending Review data sets are not in
> double-entry bookkeeping form. They can be mapped onto it in at least the
> following ways:
>
> 1a. We could make a single account for "the government", representing all
> the money available for spending. We could then make one account for every
> combination of classifiers (other than year), representing all the money
> that has already been spent on a particular thing. Then, each item of
> spending could become a transaction from the government to somewhere else.
>
> 1b. We could choose one of the tree structures in the section "Aggregation
> is subjective" above. We could then make an account for each node in the
> tree. Then, each item of spending could become a set of transactions each
> representing one step along the path down the tree from "the government"
> (the root of the tree) to one of the leaves.

[...]

> However, these mappings are obviously all wrong. 1a would merely construct
> an obfuscated version of the original data set, while teaching nobody
> anything. 1b would fail due to the subjectivity of the tree structure; we

For any *given* aggregate datasets (such as PESA or CRA) this is
undoubtedly true. For each such dataset we can construct a better
mapping (and have done so already in fact [3]). However, the real
question is whether there are benefits in the double-entry model when
we are storing many such datasets into a common domain-model. I think
there is since the double-entry model is a lowest common denominator.
Furthermore, I think it will scale up to the situation where we get
"real" (or "pseudo-real") double-entry accounts data as with e.g.
COINS.

[...]

> The problem here is that this mapping problem is ill-posed. We are looking
> at it the wrong way around.

See previous comment :)

> Mapping double-entry bookkeeping onto our data sets
> ---------------------------------------------------
>
> If we did manage somehow to construct a full set of accounts for the whole
> of government, we should in principle be able to reconstruct the data sets
> we have, i.e. the CRA, the PESA and the Spending Review.

This is exactly what COINS is, it turns out. CRA and PESA are directly
derived from it and for budget and out-turn COINS is a "book of
prime-entry" and is used in a double-entry type manner.

> To do this for e.g. the CRA, we would have to identify the boundary
> between the set of accounts which the CRA treats as "inside" the
> government, and the set of accounts which it treats as "outside" the
> government. Then, we would aggregate all the "inside" accounts together
> into one account called "the government", and aggregagate all the
> "outside" accounts together into one account called "the rest of the
> world". The transactions which cross the boundary would end up with equal
> and opposite postings in each of the two accounts. If we then make a
> statement for "the government", and erase the intial and final balances,
> we would recover the CRA data set.
>
> To spell it out:
>
>   - The spending items in data sets are *postings*.
>
>   - The data sets have already been severely aggregated over *accounts*,
> though they retain good information about transactions.

Right, but aggregate account object which aggregate accounts beneath
them are standard stuff so this fits well with the double-entry model.

>   - The classifiers in the data sets are properties of *transactions*.

No I'd say they were generally properties of accounts. E.g.
Department, Region, COFOG are all properties of the accounts not the
transactions IMO.

[...]

> Implications for the data model
> -------------------------------

[...]

> A possible data model
> =====================
>
> (This is now partly beyond the scope of what we discussed in the meeting,
> but only by my filling in some details).
>
> I'm not going to pretend that I have everything worked out, but let me
> construct a straw man. Let's represent our (many) data sets pretty much in
> the form that we receive them. Let's abandon for the moment the ambition

This is where I disagree. I think it is worth the (small) amount of
effort to convert into double-entry form. We should still be able to
reconstruct any dataset we enter from our store the "no-data-loss"
condition.

> to relate different data sets to each other. Let us instead aim to provide

Agreed. No attempt to relate different datasets for the time being
(other than for aggregation if that is possible i.e. if one dataset is
completely derivable from another).

> the following functionality:

Note we already have:

<http://wiki.okfn.org/p/Where_Does_My_Money_Go/Use_Cases#Store>

May be worth integrating some of thise there ...

>   - A user should be able to upload a complete new data set, which must
> satisfy the "exhaustive" and "non-overlapping" properties. They should be
> able to name it and explain what it is. They should get a (public) URL for
> it, and it should be added to an index.

Agreed.

>   - A user should be able to define new classifiers for an existing data
> set. They should be able to go through a (possibly partial) list of
> spending items, and for each one say "that goes in such-and-such a
> category". They should then be able to save their list, name it, and
> explain what it is. They should get a (public) URL for it, and it should
> be added to an index.

Agreed, though both here  and below I'd like to clarify the meaning of
"spending items".

>   - A user should be able to define a tree structure for an existing data
> set. For each node of the tree, they need to specify which spending items
> are aggregated together to make the node. They do this by defining simple
> rules based on the classifier data. Having constructed a valid tree, they
> should be able to name it and explain what it is. They should get a
> (public) URL for it, and it should be added to an index.

Agreed though I would say this has low priority.

>   - Given a data set and a tree structure, we should be able to make a
> customised site that looks a bit like the WDMMG prototype. We should offer
> this both as a stand-alone web page with OKF branding, and also as an
> embeddable spending browser for inclusion in other web pages.

I think we need to ask front-end people what they want here before
deciding if this it.

>   - Given a data set and a set of classifiers, we should be able to dump
> out a CSV file with one column for each classifier, one column for date
> and one column for amount. We should probably have a primary key column
> too.

Low priority IMO.

>   - We should provide whatever other trendy interfaces are easy, e.g. a
> SPARQL end point. I'm going to go to this conference next week to get some
> ideas: http://www.meetup.com/Web-Of-Data/

I think it is key we provide JSON and whatever API front-end team want.

[...]

> Table structure
> ---------------

See for comparison the domain model laid out in [2] (which I've
updated and overhauled today).

[2]: <http://wiki.okfn.org/p/Where_Does_My_Money_Go/Store>

> My straw man data model, which is based in part on something Rufus put
> together, has the following tables:

I'd suggest starting with the basic double-entry model as detailed on
[2]: i.e. Account-Transaction-Posting (ATP).

>   - DataSet. Each row represents one whole data set. There are column for
> meta-data about the data set: name, description, origin.

On [2] I call this a Slice. The reason is that we may wish to
construct Slices which don't come from an original DataSet and DataSet
has a strong existing meaning. However I'm not too bothered either
way.

>   - SpendingItem. Each row represents one item of spending. There are
> columns for: DataSet key, date, amount. Alternative name: Posting.

So this is Posting in ATP.

>   - Classifier. Each row represents a way of categorising the SpendingItems
> in a particular data set. For example, for the CRA, there would be three
> rows: Region, Department, and Function. There are columns for: DataSet
> key, name, description, origin.
>
>   - SpendingAttribute. Each row represents the statement that a particular
> Spending item is placed in a particular class by a particular Classifier.
> There are columns for: SpendingItem key, Classifier key, class (i.e.
> classifier value). This is conceptually a triple space. We have a
> constraint that the Classifier and the SpendingItem belong to the same
> DataSet.

This is just arbitrary metadata which I'd allow for Accounts and Transactions.

[...]

You mention Tree, Aggregate and AggregateAttribute. I don't think we
need these as yet if at all (they are rather front-end oriented ...)

[...]

> Summary
> =======

[...]

> I'm sure Rufus will correct me if I have misrepresented him or if
> I have forgotten something.

See above :)