[okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

Friedrich Lindenberg friedrich.lindenberg at okfn.org
Thu Apr 11 17:42:49 UTC 2013


Hey,

so technically the Postgres we have for uk25k ETL is absolutely fine for
doing analysis on this. But for the nerdy hours, I'd recommend Apache Pig:
it's really ugly (thus the name, I presume) and runs off S3 buckets in EMR,
but I like the semi-SQL processing language they have:
http://wiki.apache.org/pig/PigLatin

Would be cool to do a few scripts with this...

- Friedrich


On Thu, Apr 11, 2013 at 7:04 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> Hi folks,
>
> I'm playing around with some largish CSV files as part of a data
> investigation for OpenSpending to look at which companies got paid the
> most by (central) government in the UK last year. (More details can be
> found in this issue:
> <https://github.com/openspending/thingstodo/issues/5>)
>
> The dataset i'm working with is the UK departmental spending which,
> thanks to Friedrich's efforts, is already nicely ETL'd into one big
> 3.7 Gb file [1].
>
> My question is do folks have any thoughts on how best to do quick and
> dirty analytics on this? In particular, I was considering options
> like:
>
> * Postgresql - load, build indexes and then sum, avg etc (already
> started on this)
> * Elastic MapReduce (AWS Hadoop)
> * Google BigQuery
>
> Let me know your thoughts!
>
> Regards,
>
> Rufus
>
> [1]: Details of the file
> https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130411/1992601c/attachment-0002.html>


More information about the okfn-labs mailing list