[wdmmg-dev] Working mongo preaggregations (different approach)

Sun Jan 16 02:48:31 UTC 2011

Hi,

I have implemented wdmmg MongoDB aggregator which was inspired by the way how Friedrich is aggregating. Here is example code how the computation is being run:

	http://databrewery.org/cubes/doc/computing.html

Basically it computes aggregations based on all combinations of dimension/level keys from fact mongo collection ('entry') into cube mongo collection (might be the same as fact collection, distinct key is used). More about cube builder and how the computation is being done:

	http://databrewery.org/cubes/doc/api/builders.html#cubes.builders.MongoSimpleCubeBuilder

Source:
	https://bitbucket.org/Stiivi/cubes/src/c1e72314359d/cubes/builders/mongo.py

Example (simplified) logical model description is here (it does not cover all dimensions/hierarchies):

	https://bitbucket.org/Stiivi/cubes/src/c1e72314359d/doc/wdmmg_model.json

Pre-aggregation for all possible combinations of 5 dimensions and their hierarchies (3 for cofog) takes ~37 seconds.

To align our terminology:

- dimensions are what you cal tags/tag-like attributes, in fact a dimension is any attribute that is interesting to measure fact by
- do not get too confused by hierarchies - any single unrelated attribute can be a dimension - flat dimension without hierarchy, for example currently "time" is the attribute

Note that the aggregator is general and can be reused for other models or to create other kinds of views of the same data. For example, in wdmmg if you have all datasets in 'entry' collection, you can have 'dataset' dimension and have:
- one logical model per country/dataset - with dataset specialities
- one global logical model for all datasets with "lowest common denominator" of dimensions + measures for comparisons between countries/datasets

Last but not least, a recommendation for measures: I've noticed that you are using some coefficient for the amount measure. It is recommended to keep all versions of measures if possible: raw, computed, with VAT, witout VAT, with discount, excluding discount, log scale, rounded to 1000's, ... name it. It is up to user to choose the one he wants to use. And it is much easier to flip one attribute name in an application configuration than to change code.

Good night,

Stefan

p.s.: Another advantage of this approach is, that it can be (and in most cases already is) unit tested.