[openspending-dev] Experiment: flat-file aggregator "API"

Friedrich Lindenberg friedrich at pudo.org
Thu Apr 30 08:29:59 UTC 2015

Looks like they just blogged about the alignment work yesterday:

- Friedrich

On Thu, Apr 30, 2015 at 10:27 AM, Friedrich Lindenberg <friedrich at pudo.org>

> Good points all around, this is getting fun! Some thoughts inline.
> On Thu, Apr 30, 2015 at 8:46 AM, Paul Walsh <paulywalsh at gmail.com> wrote:
>> > It had come across my inbox, but I hadn't really grokked it. Looking at
>> your proposal now, I only really have one question: how is this really
>> different from spendb? If you do a web interface, you do some sort of
>> authz, you do some sort of mapping, ETL progress stuff, etc. -- this is
>> exactly where you will end up. I can only beg of you, please have a look at
>> the code base: https://github.com/mapthemoney/spendb>>
>> I’ll try to take a deeper look next week. But the bottom line is, as far
>> as I see, that the only substantial difference in the short term is the
>> microservices/monolith thing. In the long term, the microservices approach
>> is supposed to *facilitate* growth of an expanded ecosystem.
> Yep, I understand where you're coming from, even though I really don't
> think we should let ourselves get away with phrases like "facilitate growth
> of expanded ecosystems". It's the open data equivalent of development's
> "empowering people in developing countries": I vaguely understand that
> you're trying to do something nice, but I have no idea how you plan to
> achieve it.
> More concretely, my take-away from doing data tech so far is that the
> re-use potential of (general purpose) libraries is much larger than that of
> API services. That's why I enjoy cubes, goodtables, tellme, loadkit,
> archivekit and apikit much more than a diagram bubble called "event hub"
> (which implies that you want me to hook into your internal message passing
> system somehow?).
> SpenDB is 4400 SLOC at the moment, and I believe that it can be further
> simplified in some places. You can split it up into three services with
> 2000 SLOC apiece, but that really doesn't buy you that much more than HTTP
> latency.
> I see what you are saying. I think it is better to think of the OSDP/BDP
>> flat file storage of OS v2, as a whole, as a mesh of data with certain
>> predictable attributes on each data point.
>> This alone *is* valuable, and BDP (from which OSDP is derived) will help
>> deliver a whole range of pros in comparative analysis and high-level
>> standardisation. Time and time again over the last several years I’ve been
>> dealing with government budgets, standardisation and comparative analysis
>> have been the most prominent aspects of interest.
> I don't buy that "standardisation" in itself is a goal, I think it's a
> means to achieve other things. And as a method, it's got a somewhat spotty
> record: 12 years after XBRL was released, getting basic accounts data of
> all SEC companies would still be a total bitch, 5 years after IATI was
> released I am not familiar with a single Aid Information Management System
> that would consume IATI's XML format (rumor has it DG is working on this).
> "comparative analysis", however, is certainly a real use case. Let's pick
> two concrete examples and play it through: a) I want to compare per-capita
> secondary education expenditure across different German federal states, and
> b) I want to compare defence expenditure amongst all EU member states.
> For (a), I will have a group of datasets which (by virtue of a federal
> law) will all contain a dimension named "Hauptfunktion" (second-level
> functional expenditure classification, for those not speaking ze language).
> So I can just aggregate spending in the given Hauptfunktion for each state,
> normalise per capita, and make purty bar charts. Done. Renaming my columns
> according to BDP would have done me absolutely no good (except that all the
> local politicians who know Hauptfunktionen would have been confused).
> For (b), I have a bigger problem. Looking at the UK and DE budgets, I can
> quickly figure out that I would like to compare Cofog1 with Hauptfunktion,
> Cofog2 with Oberfunktion, Cofog3 with Funktion. That's the bit which the
> standard would help me with.
> But now I'm still left with the need to actually align these
> classification schemes. So I want to 1) decide what my spine is, 2) map
> specific local spend taxonomies to that spine, e.g. by throwing it into
> PyBossa and getting a bunch of pol-sci students to work through it and 3)
> generate aggregates in which the spine is used for aggregation instead of
> the local classification.
> To my knowledge, the only member of our community that is pulling this off
> is Mark Brough (http://data.aidonbudget.org/SN/), and from his quiet
> curses in our office I can tell that it's not an easy thing to do.
> So I think that saying "if you want to perform comparative anlaysis, give
> us your data aligned towards GFSM and COFOG spines" is pretty much shipping
> around the problem. I think it would be much more fun if OpenSpending
> actually provided the tools to do this. Imagine having a web service ( :D )
> that uses annotations on the OpenSpending OLAP logical model to determine
> which dimensions in a given dataset are supposed to be aligned with which
> spine. It would then download all dimension members for the local dimension
> and feed them to a PyBossa app, wait for reconciliation towards the spine
> to complete and finally load the mapping back into our warehouse, where
> they become global dimensions (
> https://pythonhosted.org/cubes/model.html#dimension-visibility).
> The cool thing about this is that it is iterative and community-driven,
> i.e. the alignment does not need to exist before the data is stored and
> loaded, but instead can be added dynamically. These things are political
> and I would love to see fights break out about how to classify a certain
> budget category - rather than implicitly hiding these mappings in the
> source data, produced by whoever made the dataset.
> Again, this solution is, as far as I can tell, totally independent of
>> Well, no, OSEP-04 is trying to address this issue (
>> https://github.com/pwalsh/osep/blob/feature/osep-04/osep-04.md)…
>> quoting:
>>         • Packaging either normalized or denormalized data sources for
>> use in OpenSpending.
>>         • Packaging resources that are referenced by the spend data
>> proper, but that do not actually contain spend data. This could mean, for
>> example, rich data on the recipients of funds, or projects associated with
>> a particular set of data.
>> The basic idea being that any OSDP can have resources that are not
>> strictly budget line data.
> Point taken.
> > So, considering (2), I honestly ask for clarification: what sort of
>> value-add does the OSDP provide for me as someone who wants to make really
>> awesome OLAP cubes out of budgets?
>> They (OSDP and awesome OLAP cubes) are not mutually exclusive.
> Never claimed that, I am just exploring if and how the intermediate step
> of producing OSDP is relevant toward the goal of an OLAP representation of
> the data. (To say it in OSv2 lingo: does OSEP7 depend on OSEP4?)
>> > (On that note, I would really enjoy having a conversation about
>> extended metadata, cf.
>> https://github.com/mapthemoney/spendb/blob/master/contrib/MODEL.md#dataset-level-metadata
>> ).
>> You do realise that most of this is in Budget Data Package, right? It is
>> a great list though, and we should definitely discuss it specifically. If
>> you don’t want to discuss in the context of OSEP-04, that’s fine, but it
>> would be awesome for me to use such a discussion to make OSEP-04 as good as
>> it can be. Do you want to start a new thread, or do it on IRC, or, have a
>> hangout with a few interested parties?
> Yep, I do know that a lot of this overlaps. My goal is basically to
> establish a dataset browser that does not primarily focus on dataset
> titles, and instead gives me a good overview of what part of a governments'
> budgeting cycle is expressed in any dataset, and how recent and complete it
> is.
> I'd love to hang out in IRC about this; currently in transit but should be
> back on the web proper by 4pm CET?
> Cheers,
> - Friedrich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20150430/82f33d54/attachment-0002.html>

More information about the openspending-dev mailing list