[okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

Rufus Pollock rufus.pollock at okfn.org
Fri Apr 12 10:47:08 UTC 2013


Just want to say a thanks to all the folks with suggestions so far -
this has been really useful. Given my existing experience I think I
will be trying out the Postgres and simple python routes to start with
I think :-)

I've now booted a google doc for this "data investigation" and will
try and provide some updates on different approaches there:

https://docs.google.com/a/okfn.org/document/d/1LOWlROCSBEP2VmR8NsNum0zL7t8-EN2g9QjSfC5f7cM/edit#

I should also say that collaborators are obviously very welcome! (And
if you do try something out please just drop a link in the doc!)

Rufus

On 12 April 2013 09:53, David Raznick <david.raznick at okfn.org> wrote:
> Hello,
>
> Postgresql for me on this one, for me.  If you can guess or have the column
> definitions, use copy.
>
> http://www.postgresql.org/docs/9.2/static/sql-copy.html
>
> Its *much* faster than doing inserts.
>
> When doing exploration on anything about that size, if most of your columns
> you know have repeated values, I normally do something like.
>
> select  field1, feild2 ... fieldm ,  sum(value1) sum_value1, count(value1)
> count_value1 ..  into agg_table from table group by field1,field2 ..fieldm
>
> i.e make a cheap aggregated version of some kind (a basic cube of sorts), to
> lower the row count, hopefully dramatically. There is no point of indexing
> before doing this as it has to go through all the rows anyway.
>
> The new aggregate table is much nicer to query, to stuff it in a pivot
> table, or use in something like pandas or R.
>
> David
>
>
>
>
>
> On Fri, Apr 12, 2013 at 1:39 AM, Stefan Wehrmeyer
> <stefan.wehrmeyer at okfn.org> wrote:
>>
>> Hi Rufus,
>>
>> a (Python) script that streams the file through should run in constant
>> space and linear time which should be good enough for offline analytics
>> purposes (and small enough sizes).
>>
>> If you need it faster, run it in parallel. Hadoop (for sure) and BigQuery
>> (I assume) also come with their own engineering overhead though. Hadoop is
>> more difficult to built, setup and deploy (even on EC2) for quick and dirty.
>>
>> If you ever going to really run it once, Postgres seems to much of a
>> hassle and is also not that fast (index building takes time).
>>
>> To keep it quick and dirty a streaming M/R style python script is the way,
>> maybe coupled with an in-memory datastore like Redis.
>> If it gets to slow for one machine, split it up with e.g.
>> http://www.picloud.com/
>>
>> Cheers
>> Stefan
>>
>> On 11.04.2013, at 19:04 , Rufus Pollock <rufus.pollock at okfn.org> wrote:
>>
>> > Hi folks,
>> >
>> > I'm playing around with some largish CSV files as part of a data
>> > investigation for OpenSpending to look at which companies got paid the
>> > most by (central) government in the UK last year. (More details can be
>> > found in this issue:
>> > <https://github.com/openspending/thingstodo/issues/5>)
>> >
>> > The dataset i'm working with is the UK departmental spending which,
>> > thanks to Friedrich's efforts, is already nicely ETL'd into one big
>> > 3.7 Gb file [1].
>> >
>> > My question is do folks have any thoughts on how best to do quick and
>> > dirty analytics on this? In particular, I was considering options
>> > like:
>> >
>> > * Postgresql - load, build indexes and then sum, avg etc (already
>> > started on this)
>> > * Elastic MapReduce (AWS Hadoop)
>> > * Google BigQuery
>> >
>> > Let me know your thoughts!
>> >
>> > Regards,
>> >
>> > Rufus
>> >
>> > [1]: Details of the file
>> >
>> > https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
>> >
>> > _______________________________________________
>> > okfn-labs mailing list
>> > okfn-labs at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/okfn-labs
>> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>




More information about the okfn-labs mailing list