[openspending-dev] OS to Cubes mapping

Wed May 30 18:14:08 UTC 2012

On 30.5.2012, at 19:37, Friedrich Lindenberg wrote:

> Hey,
> 
> On Tue, May 29, 2012 at 7:38 PM, Stefan Urbanek
> <stefan.urbanek at gmail.com> wrote:
>> On Tue, May 29, 2012 at 12:46 AM, Stefan Urbanek
>> <stefan.urbanek at gmail.com> wrote:
>> 
>> Was not able to go to sleep before I have finished this:
>> 
>> 
>> Generate model: https://gist.github.com/2821496
>> 
>> Get slicer.ini: https://gist.github.com/2821497
>> 
>> 
>> run:
>> 
>> 
>> slicer serve slicer.ini
>> 
>> 
>> and:
>> 
>> 
>> curl "http://localhost/aggregate"
>> 
>> 
>> Notes:
>> 
>> 
>> * model looks overcomplicated because OS and Cubes have different naming
>> 
>> schema (Cubes uses prefix_<cube_name> vs. OS <cube_name>__entries and
>> 
>> prefix_<dim_name> vs. <cube_name>__<dim_name>. In Cubes dimensions are
>> 
>> shared within model. That is why currently there has to be explicit mapping
>> 
>> for every single attribute. See [1]
> 
> So basically all cubes share the dimensions?

Yes. Basically - one model represents a "datamart" - a common domain with common dimensions/taxonomy/meaning.

> Do you think this would
> have advantages for us as well?

I am not yet 100% sure, will need discussion. there are advantages and dis-advantages for that... 

In the long term, I think it has more advantages. In the short term, the major disadvantage is the namespace - you have one namespace per dataset, which would require some work when building models in cubes.

I have an idea in mind, will draw something later and post it.

> It seemed a bit hard to do this from
> an access control point of view, but it may make sense - we would
> otherwise have to introduce some high-level mapping some time to link
> up the same dimension in different datasets.

here we go ... Master Data ;-)

> 
>> * "date" is not yet converted (is ignored), as it is too late, fingers are
>> 
>> heavy, keyboard soft and screen is dimming in front of me.
> 
> :)

Speaking about this, how can I get information about what levels the date in the dataset has?

> 
>> * cubes introduced new checking for string-cut conversions, that every
>> 
>> dimension key has to have form of identifier (alphanumeric and underscore).
>> 
>> might work if requested as /report with POST json (see [2]), not tested. I
>> 
>> should probably introduce URL encoding.
> 
> I don't get this, can you clarify it a bit?
> 

Sure. Currently you can use in URLs such as ?cut=category:foo,bar only alnum+ '_' for the 'foo' and 'bar'. I had a discussion about this couple of weeks ago and we decided to be a good idea, as most of the keys in well-behaved warehouses are supposed to be in that form anyway and the details are not used for browsing. So if you have a dimension with values that contain other characters, such as spaces, then it will not work and will result either in undefined behaviour or error.

However, if you do request using /report query as described in [2] then you can specify the cut with any character. Say you have a dimension with countries and you have country_code and country_name and say we have "Czech Republic". If both were considered dimensions, then you would not be able to form URL using country name, because it contains space. Therefore instead of: "/aggregate?cut=country_name:Czech Republic" you would have to do POST /report with:

{
	"cell": [
		{ "type": "point", "path": ["Czech Republic"] }
	],
	"queries": {
		"my_query": { "query": "aggregate" }
		}
	}
}

Reasons for the key-only selection are:
* "cz" and "Czech Republic" represent the same point in the country dimension
* will result in redundant pre-aggregation (later, when implemented)
* preferred way of handling that is to always get the key. If you have UI you get dimension values anyway, so you present "Czech Republic" to the user, but you know, that it means "cz". In cubes you can get values for current cell with /dimension (respects cut=):

http://tender.sme.sk/api/dimension/date?cut=date:2012&prettyprint=yes

if you want to get it on particular depth, then you can do that with:

http://tender.sme.sk/api/dimension/date?cut=date:2012&prettyprint=yes&depth=2

you know which dimension attribute is the key from the model: level.key. So the top level key is:

hier = dimension.hierarchy()
level = hier[0]
key_attribute = level.key
label_attribute = level.label

values = browser.values(cell)
key = values[key_attribute.full_name()]
label = values[label_attribute.full_name()]

or something like that (might be simplified later...).

> 
>> [1] http://packages.python.org/cubes/mapping.html
>> 
>> [2] http://packages.python.org/cubes/server.html#reports
>> 
>> 
>> just wanted to try it :-)
> 
> Very cool hack :)
> 

Thanks :-)

Btw. with 0.9.1 release there is denormalisation implemented in slicer (no python coding needed) ... with the view you get nice speed improvement, even better with materialized view and even much better with mat.view with indexes. See http://packages.python.org/cubes/slicer.html#denormalize

> - Friedrich

Stefan Urbanek
data analyst and data brewmaster

Twitter: @Stiivi
Home: http://stiivi.com
Brewery: http://databrewery.org
Github: https://github.com/Stiivi