[openspending-dev] Experiment: flat-file aggregator "API"

Wed May 20 07:07:31 UTC 2015

I started the thread: https://discuss.okfn.org/t/open-spending-data-structure-ideas-and-suggestions/300/1 <https://discuss.okfn.org/t/open-spending-data-structure-ideas-and-suggestions/300/1>

I didn’t want to copy your points in - I though you might prefer (but I’m happy to).

> On 20 May 2015, at 09:24, Friedrich Lindenberg <friedrich at pudo.org> wrote:
> 
> Sounds good, let's move it over there!
> 
> - Fr. 
> 
> On Wed, May 20, 2015 at 7:55 AM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
> 
>> On 30 Apr 2015, at 12:59, Friedrich Lindenberg <friedrich at pudo.org <mailto:friedrich at pudo.org>> wrote:
>> 
>> I'm stuck in a very boring airport, my apologies for getting back to you quickly. 
> 
> Well, sorry for the late reply ;).
> 
> I see what you are saying here, but we now have multiple threads on different platforms about this. I think you addressed most of your points below in your metadata structure here: https://gist.github.com/pudo/d810d91778e73e991b48 <https://gist.github.com/pudo/d810d91778e73e991b48>
> 
> Do you mind if I centralize the discussion around this in a single thread on https://discuss.okfn.org/c/openspending <https://discuss.okfn.org/c/openspending>? It will definitely be easier for me to track, and it will be easier for others to follow and jump in, if we have a single entry point to the discussions around new metadata  and how it is structured for Open Spending.
> 
>> 
>> On Thu, Apr 30, 2015 at 11:30 AM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
>>> For (b), I have a bigger problem. Looking at the UK and DE budgets, I can quickly figure out that I would like to compare Cofog1 with Hauptfunktion, Cofog2 with Oberfunktion, Cofog3 with Funktion. That's the bit which the standard would help me with.
>>> 
>>> But now I'm still left with the need to actually align these classification schemes. So I want to 1) decide what my spine is, 2) map specific local spend taxonomies to that spine, e.g. by throwing it into PyBossa and getting a bunch of pol-sci students to work through it and 3) generate aggregates in which the spine is used for aggregation instead of the local classification.
>>> 
>>> To my knowledge, the only member of our community that is pulling this off is Mark Brough (http://data.aidonbudget.org/SN/ <http://data.aidonbudget.org/SN/>), and from his quiet curses in our office I can tell that it's not an easy thing to do.
>>> 
>>> So I think that saying "if you want to perform comparative anlaysis, give us your data aligned towards GFSM and COFOG spines" is pretty much shipping around the problem. I think it would be much more fun if OpenSpending actually provided the tools to do this. Imagine having a web service ( :D ) that uses annotations on the OpenSpending OLAP logical model to determine which dimensions in a given dataset are supposed to be aligned with which spine. It would then download all dimension members for the local dimension and feed them to a PyBossa app, wait for reconciliation towards the spine to complete and finally load the mapping back into our warehouse, where they become global dimensions (https://pythonhosted.org/cubes/model.html#dimension-visibility <https://pythonhosted.org/cubes/model.html#dimension-visibility>).
>>> 
>>> The cool thing about this is that it is iterative and community-driven, i.e. the alignment does not need to exist before the data is stored and loaded, but instead can be added dynamically. These things are political and I would love to see fights break out about how to classify a certain budget category - rather than implicitly hiding these mappings in the source data, produced by whoever made the dataset.
>> 
>> That all sounds awesome, and again, there is no reason that such a service could not feed data back into a Data Package (principle of progressive enhancement in the specification): 
>> 
>> to be clear, OSDP also doesn’t expect such metadata to be there *before* it is loaded, but would ideally provide a way to declare such annotations (eg: GFSM to local classification), and data packages can be updated (enhanced?) over time. If we think about this in terms of OpenSpending v2, let’s say that initially, all data packages basically provide metadata on amount/time/location. Overtime, a microservice (:)) that provides something as you describe could be used to help bring the entire datastore forward in terms of expanding use cases, etc. (just thinking out loud). 
>> 
>> The issue with this is that OSDP doesn't have the notion of a "Hauptfunktion" (to stick with my example) which I could annotate to say "map this up with COFOG". Instead, OSDP will see some columns (let's say hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc) and not understand that they form a common thing, so I would have to annotate any or all of them with the spine mapping info. In either case, it ends up being ambiguous.
>> 
>> The OSDP solution to this is naming conventions: I rename my columns from hauptfunktionID to functionalID etc. and by convention this gets picked up. The problem I have with this is that it constitutes a loss of information (i.e. the term "hauptfunktion" has an actual legal meaning beyond functional classification), and, as an aside, it also doesn't seem to support hierarchies (i.e. hauptfunktion, oberfunktion, funktion would have to be reduced to one column set).
>> 
>> The alternative is to define an explicit mapping in which I say that hauptfunktionID, hauptfunktionLabel, hauptfunktionDesc all form different attributes of the same dimension. Then I can say that this dimension should be mapped out to COFOG. That's what the OLAPpians call a logical model, which I keep hammering on about. If you want to see a data standard focussed around such modelling, I would point you at Google's DSPL (https://developers.google.com/public-data/docs/tutorial <https://developers.google.com/public-data/docs/tutorial>).
>> 
>> If you include this in OSDP, then the information in the datapackage.json would actually be sufficient to construct meaningful OLAP cubes. We are actually down to storage media at this point: instead of storing the model in the database (like OSv1/SpenDB), you'd just be keeping it in a file on S3. We can play rock, paper, scissors about this, or I could argue access latency. Whatever :) 
>> 
>> I think it is relevant (example: having annotations for functional classification mapping as part of the descriptor - this centralizes the information for use elsewhere, meaning, in an OLAP cube, but possibly in other services that might not need to *know* about an OLAP cube).
>> 
>> I would argue that the logical model is a useful increment in information to have, independently of whether you're doing cubes, triangles or expression dance with the data :) 
>>  
>> Cheers, 
>> 
>> - Friedrich 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20150520/cafb9f2b/attachment-0002.html>