[okfn-labs] Quick and dirty analytics on large CSVs - what's the best approach?

Emanuil Tolev emanuil at cottagelabs.com
Thu Apr 11 18:35:18 UTC 2013


Would Open Refine ( http://openrefine.org/ ) be of any use at all? I've
used on tens of megabytes, but certainly not thousands... it will also load
the whole thing in memory I believe, and you need to have the file locally.
On the other hand, it provides a nice UI for faceting, mass-editing and has
always been blazingly fast to apply most of its operations, even similar
string finding via edit distance or more complicated algorithms (when
trying to find misspellings of strings, e.g. organisation names). It'd be a
nice test to see how it scales.

Greetings,
Emanuil


On 11 April 2013 18:44, Rufus Pollock <rufus.pollock at okfn.org> wrote:

> Hi Vitor,
>
> Good suggestion though it is already there ;-)
>
> http://openspending.org/ukgov-25k-spending/
>
> The point here is I want to play around with the data in a quick and
> dirty way (I should emphasize the question i provided was an example -
> and not the only question I'd want to ask ;-) Think of this data as a
> generic example for playing around with CSVs of this size.
>
> I've also put up a quick post
>
> http://okfnlabs.org/blog/2013/04/11/quick-and-dirty-analysis-on-large-csv.html
> so people can more easily add comments ...
>
> Rufus
>
> On 11 April 2013 18:17, Vitor Baptista <vitor at vitorbaptista.com> wrote:
> > Hi Rufus,
> >
> > It's probably impossible to upload it through the web interface, but it
> > should be possible to import it in OpenSpending with some magic
> incantations
> > over SSH in the machine. Then you could use OS itself :P
> >
> > Also, it's not the best solution, but if you have enough RAM, R should be
> > able to handle it. It won't be fast, but might be bearable, and probably
> > easier than loading in postgres.
> >
> > Cheers,
> > Vítor Baptista.
> >
> > 2013/4/11 Rufus Pollock <rufus.pollock at okfn.org>
> >>
> >> Hi folks,
> >>
> >> I'm playing around with some largish CSV files as part of a data
> >> investigation for OpenSpending to look at which companies got paid the
> >> most by (central) government in the UK last year. (More details can be
> >> found in this issue:
> >> <https://github.com/openspending/thingstodo/issues/5>)
> >>
> >> The dataset i'm working with is the UK departmental spending which,
> >> thanks to Friedrich's efforts, is already nicely ETL'd into one big
> >> 3.7 Gb file [1].
> >>
> >> My question is do folks have any thoughts on how best to do quick and
> >> dirty analytics on this? In particular, I was considering options
> >> like:
> >>
> >> * Postgresql - load, build indexes and then sum, avg etc (already
> >> started on this)
> >> * Elastic MapReduce (AWS Hadoop)
> >> * Google BigQuery
> >>
> >> Let me know your thoughts!
> >>
> >> Regards,
> >>
> >> Rufus
> >>
> >> [1]: Details of the file
> >>
> https://github.com/openspending/thingstodo/issues/5#issuecomment-16222168
> >>
> >> _______________________________________________
> >> okfn-labs mailing list
> >> okfn-labs at lists.okfn.org
> >> http://lists.okfn.org/mailman/listinfo/okfn-labs
> >> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
> >
> >
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: http://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20130411/360d61c6/attachment-0002.html>


More information about the okfn-labs mailing list