[openspending-dev] [okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?
rufus.pollock at okfn.org
Thu Apr 11 17:44:32 UTC 2013
Good suggestion though it is already there ;-)
The point here is I want to play around with the data in a quick and
dirty way (I should emphasize the question i provided was an example -
and not the only question I'd want to ask ;-) Think of this data as a
generic example for playing around with CSVs of this size.
I've also put up a quick post
so people can more easily add comments ...
On 11 April 2013 18:17, Vitor Baptista <vitor at vitorbaptista.com> wrote:
> Hi Rufus,
> It's probably impossible to upload it through the web interface, but it
> should be possible to import it in OpenSpending with some magic incantations
> over SSH in the machine. Then you could use OS itself :P
> Also, it's not the best solution, but if you have enough RAM, R should be
> able to handle it. It won't be fast, but might be bearable, and probably
> easier than loading in postgres.
> Vítor Baptista.
> 2013/4/11 Rufus Pollock <rufus.pollock at okfn.org>
>> Hi folks,
>> I'm playing around with some largish CSV files as part of a data
>> investigation for OpenSpending to look at which companies got paid the
>> most by (central) government in the UK last year. (More details can be
>> found in this issue:
>> The dataset i'm working with is the UK departmental spending which,
>> thanks to Friedrich's efforts, is already nicely ETL'd into one big
>> 3.7 Gb file .
>> My question is do folks have any thoughts on how best to do quick and
>> dirty analytics on this? In particular, I was considering options
>> * Postgresql - load, build indexes and then sum, avg etc (already
>> started on this)
>> * Elastic MapReduce (AWS Hadoop)
>> * Google BigQuery
>> Let me know your thoughts!
>> : Details of the file
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
More information about the openspending-dev