[OpenSpending] Getting Greater London Authority spending into OpenSpending - an Update

Rufus Pollock rufus.pollock at okfn.org
Wed Apr 3 20:44:14 UTC 2013


Hi All,

I've been working to get Greater London Authority data into OpenSpending
(as mentioned in the mail last week [1]). I'm doing this motivated by a
basic question:

*Which companies got paid the most (and for doing what)? *(OS thingstodo
issue <https://github.com/openspending/thingstodo/issues/5>)

I wanted to share where I'm up to and some of the experience so far as I
think these can inform our wider efforts - and illustrate why this is
challenging.

First off, I'm keeping the code and README for this work here in a repo on
github: https://github.com/rgrp/dataset-gla

*## Data Quality Issues*

This will be a familiar lament to many (more on all of this in the
readme<https://github.com/rgrp/dataset-gla#readme>
)

There are 61 CSV files as of March 2013 (a list can be found in
scrape.json<https://github.com/rgrp/dataset-gla/blob/master/scrape.json>
).

Unfortunately the "format" varies substantially across files (even though
they are all CSV!) which makes using this data real pain. Some examples:

* no of fields and there names vary across files (e.g. SAP Document no vs
Document no)
* number of blank columns or blank lines (some files have no blank lines
(good!), many have blank lines plus some metadata etc etc)
* There is also at least one "bad" file which looks to be an excel file
saved as CSV
* Amounts are frequently formatted with "," making them appear as strings
to computers.
* Dates vary substantially in format e.g. "16 Mar 2011", "21.01.2011" etc
* No unique transaction number (possibly document number)

They also switched from monthly reporting to period reporting (where there
are 13 periods of approx 28d each).

*## Progress so far*

I do have one month loaded (Jan 2013) with a nice breakdown by "Expenditure
Account":


Due to the data wrangling issues so far I have not got all the data loaded.
What I have done is:

- Archived all the data here (in case it gets moved)
-

[1]: http://lists.okfn.org/pipermail/openspending/2013-March/001664.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending/attachments/20130403/07e2a116/attachment.html>


More information about the openspending mailing list