[openspending-dev] Experiment: flat-file aggregator "API"

Friedrich Lindenberg friedrich at pudo.org
Wed May 20 06:24:39 UTC 2015

Sounds good, let's move it over there!

- Fr.

On Wed, May 20, 2015 at 7:55 AM, Paul Walsh <paulywalsh at gmail.com> wrote:

> On 30 Apr 2015, at 12:59, Friedrich Lindenberg <friedrich at pudo.org> wrote:
> I'm stuck in a very boring airport, my apologies for getting back to you
> quickly.
> Well, sorry for the late reply ;).
> I see what you are saying here, but we now have multiple threads on
> different platforms about this. I think you addressed most of your points
> below in your metadata structure here:
> https://gist.github.com/pudo/d810d91778e73e991b48
> Do you mind if I centralize the discussion around this in a single thread
> on https://discuss.okfn.org/c/openspending? It will definitely be easier
> for me to track, and it will be easier for others to follow and jump in, if
> we have a single entry point to the discussions around new metadata  and
> how it is structured for Open Spending.
> On Thu, Apr 30, 2015 at 11:30 AM, Paul Walsh <paulywalsh at gmail.com> wrote:
>> For (b), I have a bigger problem. Looking at the UK and DE budgets, I can
>> quickly figure out that I would like to compare Cofog1 with Hauptfunktion,
>> Cofog2 with Oberfunktion, Cofog3 with Funktion. That's the bit which the
>> standard would help me with.
>> But now I'm still left with the need to actually align these
>> classification schemes. So I want to 1) decide what my spine is, 2) map
>> specific local spend taxonomies to that spine, e.g. by throwing it into
>> PyBossa and getting a bunch of pol-sci students to work through it and 3)
>> generate aggregates in which the spine is used for aggregation instead of
>> the local classification.
>> To my knowledge, the only member of our community that is pulling this
>> off is Mark Brough (http://data.aidonbudget.org/SN/), and from his quiet
>> curses in our office I can tell that it's not an easy thing to do.
>> So I think that saying "if you want to perform comparative anlaysis, give
>> us your data aligned towards GFSM and COFOG spines" is pretty much shipping
>> around the problem. I think it would be much more fun if OpenSpending
>> actually provided the tools to do this. Imagine having a web service ( :D )
>> that uses annotations on the OpenSpending OLAP logical model to determine
>> which dimensions in a given dataset are supposed to be aligned with which
>> spine. It would then download all dimension members for the local dimension
>> and feed them to a PyBossa app, wait for reconciliation towards the spine
>> to complete and finally load the mapping back into our warehouse, where
>> they become global dimensions (
>> https://pythonhosted.org/cubes/model.html#dimension-visibility).
>> The cool thing about this is that it is iterative and community-driven,
>> i.e. the alignment does not need to exist before the data is stored and
>> loaded, but instead can be added dynamically. These things are political
>> and I would love to see fights break out about how to classify a certain
>> budget category - rather than implicitly hiding these mappings in the
>> source data, produced by whoever made the dataset.
>> That all sounds awesome, and again, there is no reason that such a
>> service could not feed data back into a Data Package (principle of
>> progressive enhancement in the specification):
>> to be clear, OSDP also doesn’t expect such metadata to be there *before*
>> it is loaded, but would ideally provide a way to declare such annotations
>> (eg: GFSM to local classification), and data packages can be updated
>> (enhanced?) over time. If we think about this in terms of OpenSpending v2,
>> let’s say that initially, all data packages basically provide metadata on
>> amount/time/location. Overtime, a microservice (:)) that provides something
>> as you describe could be used to help bring the entire datastore forward in
>> terms of expanding use cases, etc. (just thinking out loud).
> The issue with this is that OSDP doesn't have the notion of a
> "Hauptfunktion" (to stick with my example) which I could annotate to say
> "map this up with COFOG". Instead, OSDP will see some columns (let's say
> hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc) and not understand
> that they form a common thing, so I would have to annotate any or all of
> them with the spine mapping info. In either case, it ends up being
> ambiguous.
> The OSDP solution to this is naming conventions: I rename my columns from
> hauptfunktionID to functionalID etc. and by convention this gets picked up.
> The problem I have with this is that it constitutes a loss of information
> (i.e. the term "hauptfunktion" has an actual legal meaning beyond
> functional classification), and, as an aside, it also doesn't seem to
> support hierarchies (i.e. hauptfunktion, oberfunktion, funktion would have
> to be reduced to one column set).
> The alternative is to define an explicit mapping in which I say that
> hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc all form different
> attributes of the same dimension. Then I can say that this dimension should
> be mapped out to COFOG. That's what the OLAPpians call a logical model,
> which I keep hammering on about. If you want to see a data standard
> focussed around such modelling, I would point you at Google's DSPL (
> https://developers.google.com/public-data/docs/tutorial).
> If you include this in OSDP, then the information in the datapackage.json
> would actually be sufficient to construct meaningful OLAP cubes. We are
> actually down to storage media at this point: instead of storing the model
> in the database (like OSv1/SpenDB), you'd just be keeping it in a file on
> S3. We can play rock, paper, scissors about this, or I could argue access
> latency. Whatever :)
> I think it is relevant (example: having annotations for functional
>> classification mapping as part of the descriptor - this centralizes the
>> information for use elsewhere, meaning, in an OLAP cube, but possibly in
>> other services that might not need to *know* about an OLAP cube).
> I would argue that the logical model is a useful increment in information
> to have, independently of whether you're doing cubes, triangles or
> expression dance with the data :)
> Cheers,
> - Friedrich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20150520/28649132/attachment-0002.html>

More information about the openspending-dev mailing list