[openspending-dev] Experiment: flat-file aggregator "API"

Paul Walsh paulywalsh at gmail.com
Thu Apr 30 06:46:08 UTC 2015


Hi,

Just some quick points inline:

> On 30 Apr 2015, at 01:07, Friedrich Lindenberg <friedrich at pudo.org> wrote:
> 
> Hey Paul,
> 
> thanks for these links, comments inline. 
> 
> On Wed, Apr 29, 2015 at 10:03 PM, Paul Walsh <paulywalsh at gmail.com> wrote:
> Cubepress looks like a really good start on aggregate data in flat files. Nice work. You know that this is part of the desired implementation for OS v2, and it at least looks like it could be used as part of the “spike solution” I proposed yesterday as a way to get OS v2 rolling (https://discuss.okfn.org/t/2015-near-term-technical-roadmap-for-openspending/264/2).
> 
> It had come across my inbox, but I hadn't really grokked it. Looking at your proposal now, I only really have one question: how is this really different from spendb? If you do a web interface, you do some sort of authz, you do some sort of mapping, ETL progress stuff, etc. -- this is exactly where you will end up. I can only beg of you, please have a look at the code base: https://github.com/mapthemoney/spendb
I’ll try to take a deeper look next week. But the bottom line is, as far as I see, that the only substantial difference in the short term is the microservices/monolith thing. In the long term, the microservices approach is supposed to *facilitate* growth of an expanded ecosystem. 

> 
> From my reading of this spec, there are two major differences in your proposal: 
> 
> 1) It generates static aggregates. This is obviously also something that I'm interested in, but I don't believe that it covers nearly enough use cases to be a medium- to long-term solution. At best, it's a band-aid for really small datasets. Still, I would obviously like to have a tool which computes the number of aggregates for a given set of queries, and, if that number is reasonable, generates them for upload to S3. That is something that would fit really neatly into SpenDB.

So, I think that resolving this is a matter of a session of work to get OSEP-07 (http://labs.openspending.org/osep/osep-07.html) to a reasonable draft stage, with more detail of the relation between flat files and some other “thing” that produces more complex/varied queries (which, in the micro services architecture, would likely be a distinct service).

>  
> 
> 2) It uses OpenSpending Data Packages (OSDP?). These are probably fun if you're trying to build a data catalogue which keeps data uploaded by governments. But when you want to do OLAP modelling on data provided by other people who have to more or less deal with the data they can get, I think the "data standard" cons outweight the pros:
> 
> First off, as Stefan Urbanek (my personal Ralph Kimball) would probably say: the standard mainly describes the physical structure of the data, but not the logical model by which they are to be queried (cf. https://pythonhosted.org/cubes/model.html). The latter is required to actually do aggregates in a meaningful way; and in terms of modelling it is the hard part.

> Second, the more semantic aspects (i.e. the "recommended" and "special" fields in BDP) are what would make this useful for analysis, and BDP makes this all into a really weird convention compliance problem where you
> 
> a) read the BDP spec (yay volunteers reading and interpreting dozen-page specs!),
> b) map your budgets to GFSM and COFOG (how? what happens to the original classifications?),
> c) then indicate the semantics of these items through magic column naming. 

I see what you are saying. I think it is better to think of the OSDP/BDP flat file storage of OS v2, as a whole, as a mesh of data with certain predictable attributes on each data point. 

This alone *is* valuable, and BDP (from which OSDP is derived) will help deliver a whole range of pros in comparative analysis and high-level standardisation. Time and time again over the last several years I’ve been dealing with government budgets, standardisation and comparative analysis have been the most prominent aspects of interest.

As for the hard part - there are a few hard parts. I think you are greatly underestimating the value of a shared specification with a minimal set of required attributes, and an extended set of recommended attributes for progressive enhancement of the metadata. Sure, some stuff is now *required* on BDP that is hard to enforce on “normal” users (like COFOG), and that is one of the major drivers for OSDP as an intermediate step (and also see the proposed changes for BDP here: https://github.com/openspending/budget-data-package/pull/33).

And specifically for how this relates to OLAP modelling: 

No one is claiming that the basis of flat file data packages provides everything required for OLAP modeling. However, it does provide value in a common structure/standard for spend data, and it does allow for an OLAP micro service to be build that would go further with this data.

> 
> That just seems like a major regression on UX to me. We had this four years ago in OpenSpending ("to", "from", "amount" and "time") and it was a horrible mess. This is just a more fancy version of the same mess. Defining the semantics of a dataset's dimensions should be a part of the user interface, and not some weird specification text exegesis exercise and magic file formats.
> 
> In the process, you're also likely to constrain the sorts of data that can be imported, e.g. I can't load that contract awards data that I really want to play with (because it doesn't use these budget-style dimensions).

Well, no, OSEP-04 is trying to address this issue (https://github.com/pwalsh/osep/blob/feature/osep-04/osep-04.md)…

quoting:

	• Packaging either normalized or denormalized data sources for use in OpenSpending.
	• Packaging resources that are referenced by the spend data proper, but that do not actually contain spend data. This could mean, for example, rich data on the recipients of funds, or projects associated with a particular set of data.

The basic idea being that any OSDP can have resources that are not strictly budget line data.

>  
> The OSEP 4 pull request Rufus refers to is one I’ve been working on here: https://github.com/openspending/osep/pull/13 (probably we have some work to do, and any comments would be welcome).
> 
> So, considering (2), I honestly ask for clarification: what sort of value-add does the OSDP provide for me as someone who wants to make really awesome OLAP cubes out of budgets?


They (OSDP and awesome OLAP cubes) are not mutually exclusive.
 
> 
> We've established column types, and "recommended/special fields" can assist the creation of a logical if they exist - but they will not replace it.
> 
> You might argue dataset-level metadata, but I would say that in 2015 it would be expected for that stuff to be editable through a web interface, rather than some file-based solution.

Sure, everyone wants it to be editable via a web interface. That is not precluded by having file descriptors for standardisation.

> (On that note, I would really enjoy having a conversation about extended metadata, cf. https://github.com/mapthemoney/spendb/blob/master/contrib/MODEL.md#dataset-level-metadata). 

You do realise that most of this is in Budget Data Package, right? It is a great list though, and we should definitely discuss it specifically. If you don’t want to discuss in the context of OSEP-04, that’s fine, but it would be awesome for me to use such a discussion to make OSEP-04 as good as it can be. Do you want to start a new thread, or do it on IRC, or, have a hangout with a few interested parties?

> 
> Cheers, 
> 
> - Friedrich 
> 



More information about the openspending-dev mailing list