[openspending-dev] Quick and dirty analytics on large CSVs - what's the best approach?
Rufus Pollock
rufus.pollock at okfn.org
Thu Apr 11 17:04:00 UTC 2013
Hi folks,
I'm playing around with some largish CSV files as part of a data
investigation for OpenSpending to look at which companies got paid the
most by (central) government in the UK last year. (More details can be
found in this issue:
<https://github.com/openspending/thingstodo/issues/5>)
The dataset i'm working with is the UK departmental spending which,
thanks to Friedrich's efforts, is already nicely ETL'd into one big
3.7 Gb file [1].
My question is do folks have any thoughts on how best to do quick and
dirty analytics on this? In particular, I was considering options
like:
* Postgresql - load, build indexes and then sum, avg etc (already
started on this)
* Elastic MapReduce (AWS Hadoop)
* Google BigQuery
Let me know your thoughts!
Regards,
Rufus
[1]: Details of the file
https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
More information about the openspending-dev
mailing list