[okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

David Raznick david.raznick at okfn.org
Fri Apr 12 08:53:13 UTC 2013


Hello,

Postgresql for me on this one, for me.  If you can guess or have the column
definitions, use copy.

http://www.postgresql.org/docs/9.2/static/sql-copy.html

Its *much* faster than doing inserts.

When doing exploration on anything about that size, if most of your columns
you know have repeated values, I normally do something like.

select  field1, feild2 ... fieldm ,  sum(value1) sum_value1, count(value1)
count_value1 ..  into agg_table from table group by field1,field2 ..fieldm

i.e make a cheap aggregated version of some kind (a basic cube of sorts),
to lower the row count, hopefully dramatically. There is no point of
indexing before doing this as it has to go through all the rows anyway.

The new aggregate table is much nicer to query, to stuff it in a pivot
table, or use in something like pandas or R.

David





On Fri, Apr 12, 2013 at 1:39 AM, Stefan Wehrmeyer <stefan.wehrmeyer at okfn.org
> wrote:

> Hi Rufus,
>
> a (Python) script that streams the file through should run in constant
> space and linear time which should be good enough for offline analytics
> purposes (and small enough sizes).
>
> If you need it faster, run it in parallel. Hadoop (for sure) and BigQuery
> (I assume) also come with their own engineering overhead though. Hadoop is
> more difficult to built, setup and deploy (even on EC2) for quick and dirty.
>
> If you ever going to really run it once, Postgres seems to much of a
> hassle and is also not that fast (index building takes time).
>
> To keep it quick and dirty a streaming M/R style python script is the way,
> maybe coupled with an in-memory datastore like Redis.
> If it gets to slow for one machine, split it up with e.g.
> http://www.picloud.com/
>
> Cheers
> Stefan
>
> On 11.04.2013, at 19:04 , Rufus Pollock <rufus.pollock at okfn.org> wrote:
>
> > Hi folks,
> >
> > I'm playing around with some largish CSV files as part of a data
> > investigation for OpenSpending to look at which companies got paid the
> > most by (central) government in the UK last year. (More details can be
> > found in this issue:
> > <https://github.com/openspending/thingstodo/issues/5>)
> >
> > The dataset i'm working with is the UK departmental spending which,
> > thanks to Friedrich's efforts, is already nicely ETL'd into one big
> > 3.7 Gb file [1].
> >
> > My question is do folks have any thoughts on how best to do quick and
> > dirty analytics on this? In particular, I was considering options
> > like:
> >
> > * Postgresql - load, build indexes and then sum, avg etc (already
> > started on this)
> > * Elastic MapReduce (AWS Hadoop)
> > * Google BigQuery
> >
> > Let me know your thoughts!
> >
> > Regards,
> >
> > Rufus
> >
> > [1]: Details of the file
> >
> https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
> >
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130412/4c744f65/attachment-0002.html>


More information about the okfn-labs mailing list