[openspending-dev] Experiment: flat-file aggregator "API"

Thu Apr 30 09:30:47 UTC 2015

> On 30 Apr 2015, at 11:27, Friedrich Lindenberg <friedrich at pudo.org> wrote:
> 
> Good points all around, this is getting fun! Some thoughts inline. 
> 
> On Thu, Apr 30, 2015 at 8:46 AM, Paul Walsh <paulywalsh at gmail.com <mailto:paulywalsh at gmail.com>> wrote:
> > It had come across my inbox, but I hadn't really grokked it. Looking at your proposal now, I only really have one question: how is this really different from spendb? If you do a web interface, you do some sort of authz, you do some sort of mapping, ETL progress stuff, etc. -- this is exactly where you will end up. I can only beg of you, please have a look at the code base: https://github.com/mapthemoney/spendb <https://github.com/mapthemoney/spendb> …
> 
> I’ll try to take a deeper look next week. But the bottom line is, as far as I see, that the only substantial difference in the short term is the microservices/monolith thing. In the long term, the microservices approach is supposed to *facilitate* growth of an expanded ecosystem.
> 
> Yep, I understand where you're coming from, even though I really don't think we should let ourselves get away with phrases like "facilitate growth of expanded ecosystems". It's the open data equivalent of development's "empowering people in developing countries": I vaguely understand that you're trying to do something nice, but I have no idea how you plan to achieve it.

Point taken on the marketing speak :). 

> 
> More concretely, my take-away from doing data tech so far is that the re-use potential of (general purpose) libraries is much larger than that of API services. That's why I enjoy cubes, goodtables, tellme, loadkit, archivekit and apikit much more than a diagram bubble called "event hub" (which implies that you want me to hook into your internal message passing system somehow?).
> 
> SpenDB is 4400 SLOC at the moment, and I believe that it can be further simplified in some places. You can split it up into three services with 2000 SLOC apiece, but that really doesn't buy you that much more than HTTP latency.

Fair enough, I don’t disagree with your comments about general purpose libraries. Let’s just leave the monolith/microservices thing aside from the discussion now, because there are are more interesting issues to collaborate on.

>  
> 
> I see what you are saying. I think it is better to think of the OSDP/BDP flat file storage of OS v2, as a whole, as a mesh of data with certain predictable attributes on each data point.
> 
> This alone *is* valuable, and BDP (from which OSDP is derived) will help deliver a whole range of pros in comparative analysis and high-level standardisation. Time and time again over the last several years I’ve been dealing with government budgets, standardisation and comparative analysis have been the most prominent aspects of interest.
> 
> I don't buy that "standardisation" in itself is a goal, I think it's a means to achieve other things. And as a method, it's got a somewhat spotty record: 12 years after XBRL was released, getting basic accounts data of all SEC companies would still be a total bitch, 5 years after IATI was released I am not familiar with a single Aid Information Management System that would consume IATI's XML format (rumor has it DG is working on this).
>  
> 
> "comparative analysis", however, is certainly a real use case. Let's pick two concrete examples and play it through: a) I want to compare per-capita secondary education expenditure across different German federal states, and b) I want to compare defence expenditure amongst all EU member states.
> 
> For (a), I will have a group of datasets which (by virtue of a federal law) will all contain a dimension named "Hauptfunktion" (second-level functional expenditure classification, for those not speaking ze language). So I can just aggregate spending in the given Hauptfunktion for each state, normalise per capita, and make purty bar charts. Done. Renaming my columns according to BDP would have done me absolutely no good (except that all the local politicians who know Hauptfunktionen would have been confused).

Agreed. ODSP will have a mapping object for things so that renaming is not required. I think there is a good chance that something like this will also land in BDP too.

> 
> For (b), I have a bigger problem. Looking at the UK and DE budgets, I can quickly figure out that I would like to compare Cofog1 with Hauptfunktion, Cofog2 with Oberfunktion, Cofog3 with Funktion. That's the bit which the standard would help me with.
> 
> But now I'm still left with the need to actually align these classification schemes. So I want to 1) decide what my spine is, 2) map specific local spend taxonomies to that spine, e.g. by throwing it into PyBossa and getting a bunch of pol-sci students to work through it and 3) generate aggregates in which the spine is used for aggregation instead of the local classification.
> 
> To my knowledge, the only member of our community that is pulling this off is Mark Brough (http://data.aidonbudget.org/SN/ <http://data.aidonbudget.org/SN/>), and from his quiet curses in our office I can tell that it's not an easy thing to do.
> 
> So I think that saying "if you want to perform comparative anlaysis, give us your data aligned towards GFSM and COFOG spines" is pretty much shipping around the problem. I think it would be much more fun if OpenSpending actually provided the tools to do this. Imagine having a web service ( :D ) that uses annotations on the OpenSpending OLAP logical model to determine which dimensions in a given dataset are supposed to be aligned with which spine. It would then download all dimension members for the local dimension and feed them to a PyBossa app, wait for reconciliation towards the spine to complete and finally load the mapping back into our warehouse, where they become global dimensions (https://pythonhosted.org/cubes/model.html#dimension-visibility <https://pythonhosted.org/cubes/model.html#dimension-visibility>).
> 
> The cool thing about this is that it is iterative and community-driven, i.e. the alignment does not need to exist before the data is stored and loaded, but instead can be added dynamically. These things are political and I would love to see fights break out about how to classify a certain budget category - rather than implicitly hiding these mappings in the source data, produced by whoever made the dataset.

That all sounds awesome, and again, there is no reason that such a service could not feed data back into a Data Package (principle of progressive enhancement in the specification): 

to be clear, OSDP also doesn’t expect such metadata to be there *before* it is loaded, but would ideally provide a way to declare such annotations (eg: GFSM to local classification), and data packages can be updated (enhanced?) over time. If we think about this in terms of OpenSpending v2, let’s say that initially, all data packages basically provide metadata on amount/time/location. Overtime, a microservice (:)) that provides something as you describe could be used to help bring the entire datastore forward in terms of expanding use cases, etc. (just thinking out loud).  

> 
> Again, this solution is, as far as I can tell, totally independent of OSDP. 
>  
> Well, no, OSEP-04 is trying to address this issue (https://github.com/pwalsh/osep/blob/feature/osep-04/osep-04.md) <https://github.com/pwalsh/osep/blob/feature/osep-04/osep-04.md)>…
> 
> quoting:
> 
>         • Packaging either normalized or denormalized data sources for use in OpenSpending.
>         • Packaging resources that are referenced by the spend data proper, but that do not actually contain spend data. This could mean, for example, rich data on the recipients of funds, or projects associated with a particular set of data.
> 
> The basic idea being that any OSDP can have resources that are not strictly budget line data.
> 
> Point taken.
> 
> > So, considering (2), I honestly ask for clarification: what sort of value-add does the OSDP provide for me as someone who wants to make really awesome OLAP cubes out of budgets?
> 
> They (OSDP and awesome OLAP cubes) are not mutually exclusive.
> 
> Never claimed that, I am just exploring if and how the intermediate step of producing OSDP is relevant toward the goal of an OLAP representation of the data. (To say it in OSv2 lingo: does OSEP7 depend on OSEP4?)

I think it is relevant (example: having annotations for functional classification mapping as part of the descriptor - this centralizes the information for use elsewhere, meaning, in an OLAP cube, but possibly in other services that might not need to *know* about an OLAP cube).

>  
> > (On that note, I would really enjoy having a conversation about extended metadata, cf. https://github.com/mapthemoney/spendb/blob/master/contrib/MODEL.md#dataset-level-metadata <https://github.com/mapthemoney/spendb/blob/master/contrib/MODEL.md#dataset-level-metadata>).
> 
> You do realise that most of this is in Budget Data Package, right? It is a great list though, and we should definitely discuss it specifically. If you don’t want to discuss in the context of OSEP-04, that’s fine, but it would be awesome for me to use such a discussion to make OSEP-04 as good as it can be. Do you want to start a new thread, or do it on IRC, or, have a hangout with a few interested parties?
> 
> Yep, I do know that a lot of this overlaps. My goal is basically to establish a dataset browser that does not primarily focus on dataset titles, and instead gives me a good overview of what part of a governments' budgeting cycle is expressed in any dataset, and how recent and complete it is. 
> 
> I'd love to hang out in IRC about this; currently in transit but should be back on the web proper by 4pm CET?

Not sure if Rufus, or Tryggvi, or anyone else is interested and available around that time, so go for it if yes, but I won’t be available. I’d love to chat about it though, maybe I’ll just try to ping you on IRC over the coming days.

Paul
> 
> Cheers, 
> 
> - Friedrich 
> 
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20150430/cc8b286b/attachment-0002.html>