[openspending-dev] Experiment: flat-file aggregator "API"

Thu Apr 30 09:59:54 UTC 2015

I'm stuck in a very boring airport, my apologies for getting back to you
quickly.

On Thu, Apr 30, 2015 at 11:30 AM, Paul Walsh <paulywalsh at gmail.com> wrote:
>
> For (b), I have a bigger problem. Looking at the UK and DE budgets, I can
> quickly figure out that I would like to compare Cofog1 with Hauptfunktion,
> Cofog2 with Oberfunktion, Cofog3 with Funktion. That's the bit which the
> standard would help me with.
>
> But now I'm still left with the need to actually align these
> classification schemes. So I want to 1) decide what my spine is, 2) map
> specific local spend taxonomies to that spine, e.g. by throwing it into
> PyBossa and getting a bunch of pol-sci students to work through it and 3)
> generate aggregates in which the spine is used for aggregation instead of
> the local classification.
>
> To my knowledge, the only member of our community that is pulling this off
> is Mark Brough (http://data.aidonbudget.org/SN/), and from his quiet
> curses in our office I can tell that it's not an easy thing to do.
>
> So I think that saying "if you want to perform comparative anlaysis, give
> us your data aligned towards GFSM and COFOG spines" is pretty much shipping
> around the problem. I think it would be much more fun if OpenSpending
> actually provided the tools to do this. Imagine having a web service ( :D )
> that uses annotations on the OpenSpending OLAP logical model to determine
> which dimensions in a given dataset are supposed to be aligned with which
> spine. It would then download all dimension members for the local dimension
> and feed them to a PyBossa app, wait for reconciliation towards the spine
> to complete and finally load the mapping back into our warehouse, where
> they become global dimensions (
> https://pythonhosted.org/cubes/model.html#dimension-visibility).
>
> The cool thing about this is that it is iterative and community-driven,
> i.e. the alignment does not need to exist before the data is stored and
> loaded, but instead can be added dynamically. These things are political
> and I would love to see fights break out about how to classify a certain
> budget category - rather than implicitly hiding these mappings in the
> source data, produced by whoever made the dataset.
>
>
> That all sounds awesome, and again, there is no reason that such a service
> could not feed data back into a Data Package (principle of progressive
> enhancement in the specification):
>
> to be clear, OSDP also doesn’t expect such metadata to be there *before*
> it is loaded, but would ideally provide a way to declare such annotations
> (eg: GFSM to local classification), and data packages can be updated
> (enhanced?) over time. If we think about this in terms of OpenSpending v2,
> let’s say that initially, all data packages basically provide metadata on
> amount/time/location. Overtime, a microservice (:)) that provides something
> as you describe could be used to help bring the entire datastore forward in
> terms of expanding use cases, etc. (just thinking out loud).
>

The issue with this is that OSDP doesn't have the notion of a
"Hauptfunktion" (to stick with my example) which I could annotate to say
"map this up with COFOG". Instead, OSDP will see some columns (let's say
hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc) and not understand
that they form a common thing, so I would have to annotate any or all of
them with the spine mapping info. In either case, it ends up being
ambiguous.

The OSDP solution to this is naming conventions: I rename my columns from
hauptfunktionID to functionalID etc. and by convention this gets picked up.
The problem I have with this is that it constitutes a loss of information
(i.e. the term "hauptfunktion" has an actual legal meaning beyond
functional classification), and, as an aside, it also doesn't seem to
support hierarchies (i.e. hauptfunktion, oberfunktion, funktion would have
to be reduced to one column set).

The alternative is to define an explicit mapping in which I say that
hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc all form different
attributes of the same dimension. Then I can say that this dimension should
be mapped out to COFOG. That's what the OLAPpians call a logical model,
which I keep hammering on about. If you want to see a data standard
focussed around such modelling, I would point you at Google's DSPL (
https://developers.google.com/public-data/docs/tutorial).

If you include this in OSDP, then the information in the datapackage.json
would actually be sufficient to construct meaningful OLAP cubes. We are
actually down to storage media at this point: instead of storing the model
in the database (like OSv1/SpenDB), you'd just be keeping it in a file on
S3. We can play rock, paper, scissors about this, or I could argue access
latency. Whatever :)

I think it is relevant (example: having annotations for functional
> classification mapping as part of the descriptor - this centralizes the
> information for use elsewhere, meaning, in an OLAP cube, but possibly in
> other services that might not need to *know* about an OLAP cube).
>

I would argue that the logical model is a useful increment in information
to have, independently of whether you're doing cubes, triangles or
expression dance with the data :)

Cheers,

- Friedrich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20150430/098b9e08/attachment-0002.html>