[wdmmg-dev] Preaggregations in Mongo collection

Stefan Urbanek stefan.urbanek at gmail.com
Fri Jan 7 10:51:27 UTC 2011

On 7.1.2011, at 10:01, Friedrich Lindenberg wrote:

> Hi Stiivi,
> this is fantastic stuff! I have to admit to some lack of coordination
> since I assumed you were working on datautil for now and realized that
> pre-aggregation is a prerequisite for some of the things that I wanted
> to do this week, so I've sketched out a pre-agg mechanism as well
> (https://bitbucket.org/pudo/wdmmg/changeset/39cec5179f06 very slow
> sketch!). I believe the two are different yet compatible, so here
> goes:
> On Fri, Jan 7, 2011 at 4:08 AM, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
>> I've ben analysing wdmmg data and mongo possibilities for a while to find out suitable solution for aggregated browsing. It looks like it can be very nicely done using map-reduce. Result will be pre-aggregated collection where each record would be pre-computed cuboid (aggregation at any multi-dimensional point).
> I agree that this should be done either via map/reduce or using
> MongoDBs built-in aggregation mechanism
> (http://www.mongodb.org/display/DOCS/Aggregation, which is a slight
> spin on a normal reduce).

Sure, the group command can be used as well.

>> P1. Denormalised dataset/collection
>> I've learned from Friedrich, that the current wdmmg 'entry' MongoDB collection contains denormalised(*) data and is self-sufficient most of the data browsing.
>> (*) dictionaries are not expanded into flat structure, but that is not a problem as mongo allows 'dot separated path' field references, such as 'cofog2.name'.
> This is actually a bit of a problem, since at least with my current
> implementation we need to look up the keys by which a query was made
> and thus need to transform the generated JSON into a dotted notation
> as well (or split the key by dots and do a "chained" lookup). But this
> should be easy to solve, in JS it could just be eval'd.

Can you give me an example?

>> P2. Logical model
>> For aggregated browsing through multidimensional structure, where some of the dimensions might be hierarchical (like COFOG, later maybe region), logical model has to be created. Very siply said it is mapping between actual database to the way how users see/browse data. Example of WDMMG (not complete yet) model:
>>        http://democracyfarm.org/f/wdmmg/wdmmg_model.json
> Looking at this, a few questions:
> (cubes):
> * Is the difference between a dimension and an attribute that
> dimensions are being pre-aggregated while attributes are not? If not,
> why is the distinction useful?

If you have any aggregation algorithm, it has to know which is the key attribute for the dimension, because you do not want to have redundancy - aggregate by different keys which have same meaning. In mongo if you have dictionary objects from other collections, they you can simply use _id, however you might have situations where there will be attributes forming a dimension which are not a separate objects. It would be pointless to aggregate by category name and category code as both refer to the same thing.

> * Why do we need mappings? "time" seems like a special case (and I
> want to work on that over the weekend), but elsewhere we can name the
> mongo keys whatever we like.

Mappings are abstraction over physical model. Reason for mappings:

* make you application independent of physical model
* If you decide to change simple top-level attribute to a more structured dimension, you do not need to change your application, just add mapping.
* allows you to use attribute names from user's point of view, in current wdmmg implementation you are lucky to have nice key names therefore in most cases you do not need mappings (default is dimension.attribute), however you might run into a situation where you will need if.

Logical model is meant to be used by analytical applications.

> (dimensions):
> * Do we really want to spec out all the attributes of dimensions? I'm
> a big fan of duck-typing in this whole affair and its only come back
> and bitten me once yet (trying to find all keys in a collection).

No, all non-key attributes should be included within precomputed record.

> * Regarding hierarchies I still think that there can be many
> hierarchies attached to a single classifier/entity and would like to
> propose my alternative in a seperate thread (the idea is currently
> sketched out in http://okfnpad.org/wdmmg, "Views").

This is already possible in the logical data model.

>> Pre-aggregation can be done by map-reduce function with all combinations(*) of dimensions and their levels:
>>        { dimension1: [id_key_d1], dimension2: [id_key_d2] }
>> Therefore the first argument to emit would be:
>>        { date: [this.time]}
>>        { date: [this.time], cofog: [this.cofog1._id] }
>>        { date: [this.time], cofog: [this.cofog1._id, this.cofog2._id] }
>>        { cofog: [this.cofog1._id] }
>>        { cofog: [this.cofog1._id, this.cofog2._id] }
>> I've tested it and it works.
> Cool! A few questions:
> * Why did you opt for the separate collection? I'm using the entry
> collection with a boolean flag to signify aggregates, which seems more
> consistent.

Different nature of data, I think it is more appropriate and cleaner. Also mongo map-reduce output is little bit different (see below).

> * Is it possible to run queries on the cuboids you're generating? I.e.
> do they maintain all properties but the dimensions specified for the
> aggregation? I'm using a highly inefficient approach to doing this,
> but I think that it is generally needed.

Should be, I was planning to include all necessary attributes.

>> (*) For the beginning we can just do brute-force computation: compute all combinations from scratch. Two dimensions took ~9s on my laptop, therefore i expect just couple of minutes for every combination. Later, if we consider existing pre-computation to be too slow, there are couple of algorithms how to optimise cube computation.
> I was playing with a full pre-computation yesterday by using all keys
> in a flat representation and drilling down on sorted key lists but
> this quickly grew to an enormous number of computations. I assume you
> are being more clever here, i.e. by making the keys you want to
> aggregate over explicit by naming them as dimensions.

See above: that is why you need distinction between key and other ("decorative") attributes in a dimensions, if you didn't had _id in mongo, for company it would be some company ID as key and name, address as the other attributes.

>> Result will be a collection with records:
>>        _id = {dim1 = [values], dim2 = [values]}
> If named it _aggregation_axes but it's the same thing.
>>        value = {amount_sum = ..., record_count, ... other measures }
> Why not make these first-class citizens? This is what a synthetic
> aggregation record could look like:
> http://pastebin.com/Ff2L9j7h (shortened out the labels + descriptions,
> but they're there.

_id + value are mongo map-reduce results. I would recommend going "the mongo" way, as there might be database optimisations in the future.

>> I am currently working on map-reduce query generator based on the logical data model (metadata). Hope to have it ready soon.
>> What do you think about this solution?
> Lets talf about this today if you have some time - I'm really excited
> about all of this and think we can build a flexible yet powerful
> system!

Sure. I think we are converging to common solution :-)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20110107/d3551d44/attachment.html>

More information about the openspending-dev mailing list