[wdmmg-dev] Preaggregations in Mongo collection

Stefan Urbanek stefan.urbanek at gmail.com
Fri Jan 7 03:08:53 UTC 2011


I've ben analysing wdmmg data and mongo possibilities for a while to find out suitable solution for aggregated browsing. It looks like it can be very nicely done using map-reduce. Result will be pre-aggregated collection where each record would be pre-computed cuboid (aggregation at any multi-dimensional point).


P1. Denormalised dataset/collection

I've learned from Friedrich, that the current wdmmg 'entry' MongoDB collection contains denormalised(*) data and is self-sufficient most of the data browsing.

(*) dictionaries are not expanded into flat structure, but that is not a problem as mongo allows 'dot separated path' field references, such as 'cofog2.name'.

P2. Logical model

For aggregated browsing through multidimensional structure, where some of the dimensions might be hierarchical (like COFOG, later maybe region), logical model has to be created. Very siply said it is mapping between actual database to the way how users see/browse data. Example of WDMMG (not complete yet) model:


The model entities are described here (with ER diagram):


I'm not going into details now.

We need logical model to know:

	1. all dimensions and their hierarchies(*) --> for map
	2. all measures (we have one: amount) --> for reduce

(*) some of them will be flat (year, "from"), but still one level hierarchy is kind of hierarchy


Pre-aggregation can be done by map-reduce function with all combinations(*) of dimensions and their levels:

	{ dimension1: [id_key_d1], dimension2: [id_key_d2] }

Therefore the first argument to emit would be:

	{ date: [this.time]}
	{ date: [this.time], cofog: [this.cofog1._id] }
	{ date: [this.time], cofog: [this.cofog1._id, this.cofog2._id] }
	{ cofog: [this.cofog1._id] }
	{ cofog: [this.cofog1._id, this.cofog2._id] }

I've tested it and it works.

(*) For the beginning we can just do brute-force computation: compute all combinations from scratch. Two dimensions took ~9s on my laptop, therefore i expect just couple of minutes for every combination. Later, if we consider existing pre-computation to be too slow, there are couple of algorithms how to optimise cube computation.


Result will be a collection with records:

	_id = {dim1 = [values], dim2 = [values]}
	value = {amount_sum = ..., record_count, ... other measures }

Therefore aggregated query will be quite fast:

All 2010:
	query: {_id: {date:[2010]}}

All 2010 for some cofog1:
	query: {_id: {date:[2010], d2: ["some_cofog1_code"]}}

If we had months (date dimension with year-month hierarchy) and/or quarters in other datasets:

	query: {_id: {date_ym:[2010, 12], d2: ["some_cofog1_code"]}}
	query: {_id: {date_yqm:[2010, "Q3"], d2: ["some_cofog1_code"]}}
	query: {_id: {date_yqm:[2010, "Q1", "1"], d2: ["some_cofog1_code"]}}

Each of the queries will return only one record.

I am currently working on map-reduce query generator based on the logical data model (metadata). Hope to have it ready soon.

What do you think about this solution?


Stefan Urbanek

More information about the openspending-dev mailing list