[wdmmg-dev] Classifiers and OLAP (Was: Re: Struggling with new classifier system... )

Stefan Urbanek stefan.urbanek at gmail.com
Mon Dec 27 00:25:43 UTC 2010


I would suggest to have a talk (skype/IRC) about classification. Here are couple of observations about classification, notes and examples how the OLAP can be used for browsing through classification hierarchies. The mail is quite long, there is a summary at the end.


There is generic container (mongo collection) for all classifiers for all datasets (wdmmg, bmf,...), however it looks like the only purpose of the collection is to store hierarchical data. It might seem that the collection contains "recursive hierarchies", that is, hierarchies with unknown levels (number and their names). Despite the recursiveness nature, the collection contains well known hierarchies from dataset perspective. Correct me, if I am wrong. Therefore for each dataset it would be possible to create list of used hierarchy levels and describe them. It would be great if the hierarchies were somewhere documented, for example in BMF it looks like there are two (fkz - funktion?, gpl - gruppe?) hierarchies with three known levels each (i do not know german, but looks like some levels are named kapitel, einzelplan, ...). Recursive hierarchies are very tricky to analyse and aggregate, they should be avoided if it is possible. For aggregations we still would need to "unwind" those recursions.


In our case, it looks like that it is possible to avoid them from application perspective. If the application uses logical model [1] to navigate aggregated data, then the application is hierarchy independent, because the hierarchy is described in the model metadata.

Diagram with logical model example and dimension metadata example:


Note that the model represents denormalised form - form we need to have for fast aggregations.

We would browse through dimensions and dimension levels like this:


EXAMPLE 1 - drill-down

The metadata would be used in application like this (proposal of code/API, not existing in python yet):

	# Assume we have model as JSON somewhere
	modelfile = "...../models/bmf_model.json"
	model = model_from_file(modelfile)

	# we have only one cube in this case
	cube = model.cube_with_name("bmf")
	browser = AggregationBrowser(cube = cube, connection = some_db_connection)

And now we want to drill-down through some dimension:

let's assume that we are at top-level and therefore path is:
	path = []

Then you aggregate:

	# Get current cut from URL
	# If we are at top level, then cut is empty
	cut = cut_from_url(parameters["cut"])
	# Get current slice
	# If we are at top level, then slice = whole cube
	slice = browser.slice(cut)

	# This will drill down through dimension:
	result = slice.aggregate(measure = "amount", dimension = "funktion", path = path)

	If you want to have drill-down through another dimension on same page within same cut, then just do:
	result_gruppe = slice.aggregate(measure = "amount", dimension = "gruppe", path = path)

Now result has records with keys like: kapitel_id, kapitel_name, amount_sum, record_count. Now you can construct URL links for browsing deeper by adding kapitel_id to the path. Therefore values needed for constructing HTML/chart would be:
	path: [1], label: "Kapitel 1 Name", value: 1234 --> URL would contain: cut=funktion:1
	path: [2], label: "Kapitel 2 Name", value: 2345
	path: [3], label: "Kapitel 3 Name", value: 5234

You construct link for drilling down from path. If user clicks on "Kapitel 2", path would be [2] and you can use the same code as above to get drill-down values and you will get:
	path: [2,1], label: "Einzelplan 1 Name", value: 1234 --> URL would contain: cut=kapitel:2,1
	path: [2,2], label: "Einzelplan 2 Name", value: 2345
	path: [2,3], label: "Einzelplan 3 Name", value: 5234
and so on...

See ... application does not have to know number of levels, logical model does. As he knows about key fields (used for aggregation) and description fields (used for user display) for each level.

EXAMPLE 2 - more cuts

If you add to the previous code this:
	slice = slice.cut(dimension = "date", path = [2010])

Then all aggregations will be within year 2010.

You can use *the same code* for browsing through time hierarchy: drill-down at top level: path = [], drill-down all months of 2010: path = [2010], ... Ok, maybe not in year-based budget with flat time dimension, but in other apps you can.

EXAMPLE 3 - comparison of reporting (constructing clickable table/chart on a page) flat dimension and deep dimension drill-downs:

See image:


EXAMPLE 4 - get dimension values

To get all available years within dataset:

	slice = browser.whole()
	slice.values_for_dimension(dimension = "date", path = [])

To get all level 2 values for Kapitel 1:

	values = slice.values_for_dimension(dimension = "funktion", path = [1])

To get all level 2 values for Kapitel 1 in 2010 (assume that there might be differences between years, such as new stuff/old stuff):

	slice = browser.slice(dimension = "date", path = [2010])
	# Note that this is same code as above:
	values = slice.values_for_dimension(dimension = "funktion", path = [1])


For aggregation API, we just expose metadata - logical model, perhaps just put json version of model.to_dict() for single HTTP get call. And then expose Aggregation Browser to allow third-parties to get data.


* Classifier collection contains recursive hierarchies only because it is generic container (if this statement is wrong, please give examples)
* Classifiers with known hierarchy needs to be documented (and if we decide to go OLAP way, then model has to be created [1])
* Application based on an aggregation browser is abstracted from physical data model, therefore from classifier hierarchies as well
* There is no difference between date/time and any other dimension from aggregated browsing perspective in end-user application
* We will not need to change API nor aggregations nor browser if our data structure changes, just the metadata in the logical model description.

Talk about that later on IRC/Skype.



[1] http://databrewery.org/doc/cubes.html#dimension-descriptions-in-dim-json

On 24.12.2010, at 18:33, Anna Powell-Smith wrote:

> Hi Friedrich
> Yes, sorry to disrupt your Christmas - please don't reply until next week!
> So I can handle Entries and Keys in mongo just fine, but I'm struggling a bit with the concepts behind Classifiers, particularly how to create a taxonomy. The issue is following the logic back through the example in the BMF loader [1] .
> Could I have a very simple example of how to use create_classifier, get_classifier and and classify_entry? Or even better talk this through over Skype, whenever suits you?
> thanks!
> Anna
> [1] https://bitbucket.org/okfn/wdmmg-ext/src/b40a18c9f3c0/wdmmgext/load/bmf/load.py
> On 24 December 2010 16:49, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> Hi Friedrich,
> We've very nearly got the main WDMMG site completely switched to mongo
> but Anna seems to have run into an issue with understanding how the
> classifiers work (see below).
> Reading the mail I'm not entirely sure exactly what the issue is
> (Anna: perhaps you can give a bit more detail) but a small bit of help
> from you would be really useful :)
> Perhaps we can catch up early next week (don't want to disrupt your
> xmas with WDMMG questions!).
> Rufus
> ---------- Forwarded message ----------
> From: Anna Powell-Smith <annapowellsmith at gmail.com>
> Date: 23 December 2010 22:09
> Subject: Struggling with new classifier system...
> To: Rufus Pollock <rufus.pollock at okfn.org>
> Hi Rufus
> I've added tests for all the data loaders, got them passing, and
> checked them in.
> However I've hit a blocker. Since I wrote the loaders, Friedrich has
> changed the way classifiers work, and my loaders are no longer
> compatible with it. They still run OK & do everything needed to
> support the Flash - it's just the Entry pages don't have any links to
> Classifiers. This is OK for testing Flash etc, but not for live site.
> [...]
> _______________________________________________
> wdmmg-dev mailing list
> wdmmg-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/wdmmg-dev
> _______________________________________________
> wdmmg-dev mailing list
> wdmmg-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/wdmmg-dev

Stefan Urbanek
freelance consultant, analyst


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20101227/f2096b5e/attachment.html>

More information about the openspending-dev mailing list