[openspending-dev] [okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

Thu Apr 11 17:17:37 UTC 2013

Hi Rufus,

It's probably impossible to upload it through the web interface, but it
should be possible to import it in OpenSpending with some magic
incantations over SSH in the machine. Then you could use OS itself :P

Also, it's not the best solution, but if you have enough RAM, R should be
able to handle it. It won't be fast, but might be bearable, and probably
easier than loading in postgres.

Cheers,
Vítor Baptista.

2013/4/11 Rufus Pollock <rufus.pollock at okfn.org>

> Hi folks,
>
> I'm playing around with some largish CSV files as part of a data
> investigation for OpenSpending to look at which companies got paid the
> most by (central) government in the UK last year. (More details can be
> found in this issue:
> <https://github.com/openspending/thingstodo/issues/5>)
>
> The dataset i'm working with is the UK departmental spending which,
> thanks to Friedrich's efforts, is already nicely ETL'd into one big
> 3.7 Gb file [1].
>
> My question is do folks have any thoughts on how best to do quick and
> dirty analytics on this? In particular, I was considering options
> like:
>
> * Postgresql - load, build indexes and then sum, avg etc (already
> started on this)
> * Elastic MapReduce (AWS Hadoop)
> * Google BigQuery
>
> Let me know your thoughts!
>
> Regards,
>
> Rufus
>
> [1]: Details of the file
> https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20130411/44dd12c5/attachment.html>