[openspending-dev] Quick and dirty analytics on large CSVs - what's the best approach?

Thu Apr 11 17:04:00 UTC 2013

Hi folks,

I'm playing around with some largish CSV files as part of a data
investigation for OpenSpending to look at which companies got paid the
most by (central) government in the UK last year. (More details can be
found in this issue:
<https://github.com/openspending/thingstodo/issues/5>)

The dataset i'm working with is the UK departmental spending which,
thanks to Friedrich's efforts, is already nicely ETL'd into one big
3.7 Gb file [1].

My question is do folks have any thoughts on how best to do quick and
dirty analytics on this? In particular, I was considering options
like:

* Postgresql - load, build indexes and then sum, avg etc (already
started on this)
* Elastic MapReduce (AWS Hadoop)
* Google BigQuery

Let me know your thoughts!

Regards,

Rufus

[1]: Details of the file
https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168