[okfn-labs] Help clean up the UK spending data!

Friedrich Lindenberg friedrich.lindenberg at okfn.org
Tue Jul 31 14:30:08 BST 2012

Hi all,

I've been trying to update the UK departmental spending data over the
weekend. We're offering access to this data to people at
http://data.gov.uk/openspending - but it hasn't been updated since
March last year. The main problem with this data is that each public
body in the UK reports each month's spending in a (manually composed)
spreadsheet. While they are supposed to contain the same data, each
spreadsheet is formatted a bit differently from all the others. At
this moment, we have 5800 of them.

In order to get these reports into a common format, we need to match
each column title in these reports to a pre-defined column heading in
the cleaned dataset. Most of the time, this is really straightforward
- but in some cases it isn't: a report may have ambiguous column
names. Because of these hidden caveats, the only clean option to do
this is to do all mappings manually. I've made most of the mappings
over the weekend but now my wrist is pretty swollen and it's time for
the crowd :)

So, here's what you do:

1) Go to http://nomenklatura.okfnlabs.org/uk25k-column-names and sign
in with your GitHub account.

2) Click reconcile and begin matching. Here's the game rules:

a) The format for the candidates is Name-of-Column @

b) The preferred mappings are Amount, Date, DepartmentFamilyName,
EntityName, SupplierName, ExpenseArea and ExpenseType.

c) If you're not sure how to map a value, either press "Try Another"
or add a link to the "NONSTANDARD" pseudo-column.

d) All descriptions go into "Narrative" (only one per sheet, though).

e) Expense Type Code != Expense Type - try to keep code fields and
text fields separate.

f) If something isn't a header, mark it "Invalid".

I appreciate that this is an awkward process, but have come to the
conclusion that using more automation will just give us bad data. At
the moment, we've go around 4.3mio records extracted - with your help
we can bring this up to 6mio.

Many thanks for any help,

 - Friedrich

More information about the okfn-labs mailing list