[openspending-dev] Diff for spending CSVs

Friedrich Lindenberg friedrich at pudo.org
Thu Jun 27 16:55:55 UTC 2013


This is really cool, David!

After a quick look, it looks to me like there's nothing really
spend-specific in there: have you considered pinging @onyxfish about
pushing this into csvkit? Would make a valuable contribution!

- Friedrich



On Thu, Jun 27, 2013 at 6:50 PM, David Read
<david.read at hackneyworkshop.com>wrote:

> I've written a tool to run in OpenSpending ETL for discarding the
> parts of the CSV of spending transactions that are already loaded.
> This is useful for the data.gov.uk work where the CSV is 4Gb, and
> updated daily from source data, but of that, there are only a tiny
> number of new/changed rows that need loading into the OpenSpending
> database each day.
>
> Making this was a suggestion of Pudo's:
>
> > find out how to make diff emit the only lines that have been added and
> use that to generate incremental spendingsource files.
>
> The code is in our ETL here:
> https://github.com/openspending/dpkg-uk25k/blob/master/spend_diff.py -
> feel free to put it into the core OpenSpending code if that makes
> sense.
>
> David
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20130627/79ce4124/attachment.html>


More information about the openspending-dev mailing list