[wdmmg-dev] Preaggregations in Mongo collection

Friedrich Lindenberg friedrich at pudo.org
Fri Jan 7 09:01:46 UTC 2011


Hi Stiivi,

this is fantastic stuff! I have to admit to some lack of coordination
since I assumed you were working on datautil for now and realized that
pre-aggregation is a prerequisite for some of the things that I wanted
to do this week, so I've sketched out a pre-agg mechanism as well
(https://bitbucket.org/pudo/wdmmg/changeset/39cec5179f06 very slow
sketch!). I believe the two are different yet compatible, so here
goes:

On Fri, Jan 7, 2011 at 4:08 AM, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
> I've ben analysing wdmmg data and mongo possibilities for a while to find out suitable solution for aggregated browsing. It looks like it can be very nicely done using map-reduce. Result will be pre-aggregated collection where each record would be pre-computed cuboid (aggregation at any multi-dimensional point).

I agree that this should be done either via map/reduce or using
MongoDBs built-in aggregation mechanism
(http://www.mongodb.org/display/DOCS/Aggregation, which is a slight
spin on a normal reduce).

> P1. Denormalised dataset/collection
>
> I've learned from Friedrich, that the current wdmmg 'entry' MongoDB collection contains denormalised(*) data and is self-sufficient most of the data browsing.
>
> (*) dictionaries are not expanded into flat structure, but that is not a problem as mongo allows 'dot separated path' field references, such as 'cofog2.name'.

This is actually a bit of a problem, since at least with my current
implementation we need to look up the keys by which a query was made
and thus need to transform the generated JSON into a dotted notation
as well (or split the key by dots and do a "chained" lookup). But this
should be easy to solve, in JS it could just be eval'd.

> P2. Logical model
>
> For aggregated browsing through multidimensional structure, where some of the dimensions might be hierarchical (like COFOG, later maybe region), logical model has to be created. Very siply said it is mapping between actual database to the way how users see/browse data. Example of WDMMG (not complete yet) model:
>
>        http://democracyfarm.org/f/wdmmg/wdmmg_model.json

Looking at this, a few questions:

(cubes):

* Is the difference between a dimension and an attribute that
dimensions are being pre-aggregated while attributes are not? If not,
why is the distinction useful?
* Why do we need mappings? "time" seems like a special case (and I
want to work on that over the weekend), but elsewhere we can name the
mongo keys whatever we like.

(dimensions):

* Do we really want to spec out all the attributes of dimensions? I'm
a big fan of duck-typing in this whole affair and its only come back
and bitten me once yet (trying to find all keys in a collection).
* Regarding hierarchies I still think that there can be many
hierarchies attached to a single classifier/entity and would like to
propose my alternative in a seperate thread (the idea is currently
sketched out in http://okfnpad.org/wdmmg, "Views").

> PRE-AGGREGATION
>
> Pre-aggregation can be done by map-reduce function with all combinations(*) of dimensions and their levels:
>
>        { dimension1: [id_key_d1], dimension2: [id_key_d2] }
>
> Therefore the first argument to emit would be:
>
>        { date: [this.time]}
>        { date: [this.time], cofog: [this.cofog1._id] }
>        { date: [this.time], cofog: [this.cofog1._id, this.cofog2._id] }
>        { cofog: [this.cofog1._id] }
>        { cofog: [this.cofog1._id, this.cofog2._id] }
>
> I've tested it and it works.

Cool! A few questions:

* Why did you opt for the separate collection? I'm using the entry
collection with a boolean flag to signify aggregates, which seems more
consistent.
* Is it possible to run queries on the cuboids you're generating? I.e.
do they maintain all properties but the dimensions specified for the
aggregation? I'm using a highly inefficient approach to doing this,
but I think that it is generally needed.

> (*) For the beginning we can just do brute-force computation: compute all combinations from scratch. Two dimensions took ~9s on my laptop, therefore i expect just couple of minutes for every combination. Later, if we consider existing pre-computation to be too slow, there are couple of algorithms how to optimise cube computation.

I was playing with a full pre-computation yesterday by using all keys
in a flat representation and drilling down on sorted key lists but
this quickly grew to an enormous number of computations. I assume you
are being more clever here, i.e. by making the keys you want to
aggregate over explicit by naming them as dimensions.

> PROPOSAL
>
> Result will be a collection with records:
>
>        _id = {dim1 = [values], dim2 = [values]}

If named it _aggregation_axes but it's the same thing.

>        value = {amount_sum = ..., record_count, ... other measures }

Why not make these first-class citizens? This is what a synthetic
aggregation record could look like:

http://pastebin.com/Ff2L9j7h (shortened out the labels + descriptions,
but they're there.

> I am currently working on map-reduce query generator based on the logical data model (metadata). Hope to have it ready soon.
>
> What do you think about this solution?

Lets talf about this today if you have some time - I'm really excited
about all of this and think we can build a flexible yet powerful
system!

Friedrich




More information about the openspending-dev mailing list