[openspending-dev] [okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

Tue Apr 16 01:40:24 UTC 2013

On Fri, Apr 12, 2013 at 11:47 AM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> Just want to say a thanks to all the folks with suggestions so far -
> this has been really useful. Given my existing experience I think I
> will be trying out the Postgres and simple python routes to start with
> I think :-)

Once the data is too large to fit in RAM, a lot of options become
unavailable (including R, and Python's CSV module if writing files
under Windows).

For datasets with, e.g., over 100 million rows, if you're using SQL,
you need to be looking at the bulk load code paths (called LOAD DATA
in MySQL and COPY in Postgres). MySQL lets you do arbitrary
transformations on the data as you load it in, but forces you to use
an actual file on disk; Postgres can stream data into the bulk loader,
thus you can transform it on the fly).

You can be better off using MySQL if you're not sure how well-formed
your data is (and of course checking this involves the same range
query you were going to have to do to bulk load the data anyway),
because MySQL treats a lot of badly-formed data as warnings rather
than errors, whereas Postgres can abort the bulk load at the first
error, which may not be what you want.

A lot of the motivation for CSVDDF and csv2sql came from my having to
deal with very large badly formatted US Census data files last year.

Mk