[okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

Rufus Pollock rufus.pollock at okfn.org
Thu Apr 11 17:44:32 UTC 2013


Hi Vitor,

Good suggestion though it is already there ;-)

http://openspending.org/ukgov-25k-spending/

The point here is I want to play around with the data in a quick and
dirty way (I should emphasize the question i provided was an example -
and not the only question I'd want to ask ;-) Think of this data as a
generic example for playing around with CSVs of this size.

I've also put up a quick post
http://okfnlabs.org/blog/2013/04/11/quick-and-dirty-analysis-on-large-csv.html
so people can more easily add comments ...

Rufus

On 11 April 2013 18:17, Vitor Baptista <vitor at vitorbaptista.com> wrote:
> Hi Rufus,
>
> It's probably impossible to upload it through the web interface, but it
> should be possible to import it in OpenSpending with some magic incantations
> over SSH in the machine. Then you could use OS itself :P
>
> Also, it's not the best solution, but if you have enough RAM, R should be
> able to handle it. It won't be fast, but might be bearable, and probably
> easier than loading in postgres.
>
> Cheers,
> Vítor Baptista.
>
> 2013/4/11 Rufus Pollock <rufus.pollock at okfn.org>
>>
>> Hi folks,
>>
>> I'm playing around with some largish CSV files as part of a data
>> investigation for OpenSpending to look at which companies got paid the
>> most by (central) government in the UK last year. (More details can be
>> found in this issue:
>> <https://github.com/openspending/thingstodo/issues/5>)
>>
>> The dataset i'm working with is the UK departmental spending which,
>> thanks to Friedrich's efforts, is already nicely ETL'd into one big
>> 3.7 Gb file [1].
>>
>> My question is do folks have any thoughts on how best to do quick and
>> dirty analytics on this? In particular, I was considering options
>> like:
>>
>> * Postgresql - load, build indexes and then sum, avg etc (already
>> started on this)
>> * Elastic MapReduce (AWS Hadoop)
>> * Google BigQuery
>>
>> Let me know your thoughts!
>>
>> Regards,
>>
>> Rufus
>>
>> [1]: Details of the file
>> https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>




More information about the okfn-labs mailing list