[OpenSpending] Getting Greater London Authority spending into OpenSpending - an Update

Rufus Pollock rufus.pollock at okfn.org
Wed Apr 3 20:53:07 UTC 2013


Hi All (apologies if you just got a similar email - send may been early
accidentally!),

TL;DR: http://openspending.org/gb-local-gla and the
README<https://github.com/rgrp/dataset-gla#readme>

I've been working to get Greater London Authority data into OpenSpending
(as mentioned in the mail last week [1]). I'm doing this motivated by a
basic question:

*Which companies got paid the most (and for doing what)? *(OS thingstodo
issue <https://github.com/openspending/thingstodo/issues/5>)

I wanted to share where I'm up to and some of the experience so far as I
think these can inform our wider efforts - and illustrate the challenges
just getting and cleaning up data.

*## Data Quality Issues*

First off, I'm keeping the code and
README<https://github.com/rgrp/dataset-gla#readme>for this work here
in a repo on github:
https://github.com/rgrp/dataset-gla

There are 61 CSV files as of March 2013 (a list can be found in
scrape.json<https://github.com/rgrp/dataset-gla/blob/master/scrape.json>
).

Unfortunately the "format" varies substantially across files (even though
they are all CSV!) which makes using this data real pain. Some examples:

* no of fields and there names vary across files (e.g. SAP Document no vs
Document no)
* number of blank columns or blank lines (some files have no blank lines
(good!), many have blank lines plus some metadata etc etc)
* There is also at least one "bad" file which looks to be an excel file
saved as CSV
* Amounts are frequently formatted with "," making them appear as strings
to computers.
* Dates vary substantially in format e.g. "16 Mar 2011", "21.01.2011" etc
* No unique transaction number (possibly document number)

They also switched from monthly reporting to period reporting (where there
are 13 periods of approx 28d each).

*## Progress so far*

I do have one month loaded (Jan 2013) with a nice breakdown by "Expenditure
Account":

http://openspending.org/gb-local-gla

Interestingly after some fairly standard grants to other bodies, "Claim
Settlements<http://openspending.org/gb-local-gla/expenditure-account/542420>"
comes in as the biggest item at £2.3m

- Data getting archived at
http://data.openspending.org/datasets/gb-local-gla/
- Clean up script<https://github.com/rgrp/dataset-gla/blob/master/scripts/process.js>

Regards,

Rufus

[1]: http://lists.okfn.org/pipermail/openspending/2013-March/001664.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20130403/9a65358c/attachment.html>


More information about the openspending mailing list