[openspending-dev] [okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?
rufus.pollock at okfn.org
Fri Apr 12 10:47:08 UTC 2013
Just want to say a thanks to all the folks with suggestions so far -
this has been really useful. Given my existing experience I think I
will be trying out the Postgres and simple python routes to start with
I think :-)
I've now booted a google doc for this "data investigation" and will
try and provide some updates on different approaches there:
I should also say that collaborators are obviously very welcome! (And
if you do try something out please just drop a link in the doc!)
On 12 April 2013 09:53, David Raznick <david.raznick at okfn.org> wrote:
> Postgresql for me on this one, for me. If you can guess or have the column
> definitions, use copy.
> Its *much* faster than doing inserts.
> When doing exploration on anything about that size, if most of your columns
> you know have repeated values, I normally do something like.
> select field1, feild2 ... fieldm , sum(value1) sum_value1, count(value1)
> count_value1 .. into agg_table from table group by field1,field2 ..fieldm
> i.e make a cheap aggregated version of some kind (a basic cube of sorts), to
> lower the row count, hopefully dramatically. There is no point of indexing
> before doing this as it has to go through all the rows anyway.
> The new aggregate table is much nicer to query, to stuff it in a pivot
> table, or use in something like pandas or R.
> On Fri, Apr 12, 2013 at 1:39 AM, Stefan Wehrmeyer
> <stefan.wehrmeyer at okfn.org> wrote:
>> Hi Rufus,
>> a (Python) script that streams the file through should run in constant
>> space and linear time which should be good enough for offline analytics
>> purposes (and small enough sizes).
>> If you need it faster, run it in parallel. Hadoop (for sure) and BigQuery
>> (I assume) also come with their own engineering overhead though. Hadoop is
>> more difficult to built, setup and deploy (even on EC2) for quick and dirty.
>> If you ever going to really run it once, Postgres seems to much of a
>> hassle and is also not that fast (index building takes time).
>> To keep it quick and dirty a streaming M/R style python script is the way,
>> maybe coupled with an in-memory datastore like Redis.
>> If it gets to slow for one machine, split it up with e.g.
>> On 11.04.2013, at 19:04 , Rufus Pollock <rufus.pollock at okfn.org> wrote:
>> > Hi folks,
>> > I'm playing around with some largish CSV files as part of a data
>> > investigation for OpenSpending to look at which companies got paid the
>> > most by (central) government in the UK last year. (More details can be
>> > found in this issue:
>> > <https://github.com/openspending/thingstodo/issues/5>)
>> > The dataset i'm working with is the UK departmental spending which,
>> > thanks to Friedrich's efforts, is already nicely ETL'd into one big
>> > 3.7 Gb file .
>> > My question is do folks have any thoughts on how best to do quick and
>> > dirty analytics on this? In particular, I was considering options
>> > like:
>> > * Postgresql - load, build indexes and then sum, avg etc (already
>> > started on this)
>> > * Elastic MapReduce (AWS Hadoop)
>> > * Google BigQuery
>> > Let me know your thoughts!
>> > Regards,
>> > Rufus
>> > : Details of the file
>> > https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
>> > _______________________________________________
>> > okfn-labs mailing list
>> > okfn-labs at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/okfn-labs
>> > Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
More information about the openspending-dev