[okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

Tom Morris tfmorris at gmail.com
Thu Apr 11 17:44:00 UTC 2013


So 60-odd columns by how many rows?  It appears to be a quoted CSV with
embedded commas which is a little unfortunate because if you had a unique
delimiter, my recommendation would be to start with:

zcat spending-latest.tsv.gz | cut -f 62 | sort | uniq -c | sort -r -n | head

and poke around to what are high frequency items, etc. (zcat because unless
you've got a very beefy I/O subsystem you're going to be I/O constrained
and you're better off paying the CPU penalty to decompress).

3.7 GB is microscopic in the BigQuery scheme of things, but it would allow
for extremely fast iterative analysis.

Hadoop isn't going to be quick and dirty at all.  I'd avoid it for a task
this small.

With sufficient memory, you should also be able to use Python or R to do a
lot of work, particularly if you're only working with a few columns at a
time and discarding the rest on input.

Tom



On Thu, Apr 11, 2013 at 1:04 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> Hi folks,
>
> I'm playing around with some largish CSV files as part of a data
> investigation for OpenSpending to look at which companies got paid the
> most by (central) government in the UK last year. (More details can be
> found in this issue:
> <https://github.com/openspending/thingstodo/issues/5>)
>
> The dataset i'm working with is the UK departmental spending which,
> thanks to Friedrich's efforts, is already nicely ETL'd into one big
> 3.7 Gb file [1].
>
> My question is do folks have any thoughts on how best to do quick and
> dirty analytics on this? In particular, I was considering options
> like:
>
> * Postgresql - load, build indexes and then sum, avg etc (already
> started on this)
> * Elastic MapReduce (AWS Hadoop)
> * Google BigQuery
>
> Let me know your thoughts!
>
> Regards,
>
> Rufus
>
> [1]: Details of the file
> https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130411/5b39c92d/attachment-0002.html>


More information about the okfn-labs mailing list