[openspending-dev] Experiment: flat-file aggregator "API"

Thu Apr 30 08:27:43 UTC 2015

Good points all around, this is getting fun! Some thoughts inline.

On Thu, Apr 30, 2015 at 8:46 AM, Paul Walsh <paulywalsh at gmail.com> wrote:

> > It had come across my inbox, but I hadn't really grokked it. Looking at
> your proposal now, I only really have one question: how is this really
> different from spendb? If you do a web interface, you do some sort of
> authz, you do some sort of mapping, ETL progress stuff, etc. -- this is
> exactly where you will end up. I can only beg of you, please have a look at
> the code base: https://github.com/mapthemoney/spendb …
>
> I’ll try to take a deeper look next week. But the bottom line is, as far
> as I see, that the only substantial difference in the short term is the
> microservices/monolith thing. In the long term, the microservices approach
> is supposed to *facilitate* growth of an expanded ecosystem.
>

Yep, I understand where you're coming from, even though I really don't
think we should let ourselves get away with phrases like "facilitate growth
of expanded ecosystems". It's the open data equivalent of development's
"empowering people in developing countries": I vaguely understand that
you're trying to do something nice, but I have no idea how you plan to
achieve it.

More concretely, my take-away from doing data tech so far is that the
re-use potential of (general purpose) libraries is much larger than that of
API services. That's why I enjoy cubes, goodtables, tellme, loadkit,
archivekit and apikit much more than a diagram bubble called "event hub"
(which implies that you want me to hook into your internal message passing
system somehow?).

SpenDB is 4400 SLOC at the moment, and I believe that it can be further
simplified in some places. You can split it up into three services with
2000 SLOC apiece, but that really doesn't buy you that much more than HTTP
latency.

I see what you are saying. I think it is better to think of the OSDP/BDP
> flat file storage of OS v2, as a whole, as a mesh of data with certain
> predictable attributes on each data point.
>
> This alone *is* valuable, and BDP (from which OSDP is derived) will help
> deliver a whole range of pros in comparative analysis and high-level
> standardisation. Time and time again over the last several years I’ve been
> dealing with government budgets, standardisation and comparative analysis
> have been the most prominent aspects of interest.
>

I don't buy that "standardisation" in itself is a goal, I think it's a
means to achieve other things. And as a method, it's got a somewhat spotty
record: 12 years after XBRL was released, getting basic accounts data of
all SEC companies would still be a total bitch, 5 years after IATI was
released I am not familiar with a single Aid Information Management System
that would consume IATI's XML format (rumor has it DG is working on this).

"comparative analysis", however, is certainly a real use case. Let's pick
two concrete examples and play it through: a) I want to compare per-capita
secondary education expenditure across different German federal states, and
b) I want to compare defence expenditure amongst all EU member states.

For (a), I will have a group of datasets which (by virtue of a federal law)
will all contain a dimension named "Hauptfunktion" (second-level functional
expenditure classification, for those not speaking ze language). So I can
just aggregate spending in the given Hauptfunktion for each state,
normalise per capita, and make purty bar charts. Done. Renaming my columns
according to BDP would have done me absolutely no good (except that all the
local politicians who know Hauptfunktionen would have been confused).

For (b), I have a bigger problem. Looking at the UK and DE budgets, I can
quickly figure out that I would like to compare Cofog1 with Hauptfunktion,
Cofog2 with Oberfunktion, Cofog3 with Funktion. That's the bit which the
standard would help me with.

But now I'm still left with the need to actually align these classification
schemes. So I want to 1) decide what my spine is, 2) map specific local
spend taxonomies to that spine, e.g. by throwing it into PyBossa and
getting a bunch of pol-sci students to work through it and 3) generate
aggregates in which the spine is used for aggregation instead of the local
classification.

To my knowledge, the only member of our community that is pulling this off
is Mark Brough (http://data.aidonbudget.org/SN/), and from his quiet curses
in our office I can tell that it's not an easy thing to do.

So I think that saying "if you want to perform comparative anlaysis, give
us your data aligned towards GFSM and COFOG spines" is pretty much shipping
around the problem. I think it would be much more fun if OpenSpending
actually provided the tools to do this. Imagine having a web service ( :D )
that uses annotations on the OpenSpending OLAP logical model to determine
which dimensions in a given dataset are supposed to be aligned with which
spine. It would then download all dimension members for the local dimension
and feed them to a PyBossa app, wait for reconciliation towards the spine
to complete and finally load the mapping back into our warehouse, where
they become global dimensions (
https://pythonhosted.org/cubes/model.html#dimension-visibility).

The cool thing about this is that it is iterative and community-driven,
i.e. the alignment does not need to exist before the data is stored and
loaded, but instead can be added dynamically. These things are political
and I would love to see fights break out about how to classify a certain
budget category - rather than implicitly hiding these mappings in the
source data, produced by whoever made the dataset.

Again, this solution is, as far as I can tell, totally independent of OSDP.

> Well, no, OSEP-04 is trying to address this issue (
> https://github.com/pwalsh/osep/blob/feature/osep-04/osep-04.md)…
>
> quoting:
>
>         • Packaging either normalized or denormalized data sources for use
> in OpenSpending.
>         • Packaging resources that are referenced by the spend data
> proper, but that do not actually contain spend data. This could mean, for
> example, rich data on the recipients of funds, or projects associated with
> a particular set of data.
>
> The basic idea being that any OSDP can have resources that are not
> strictly budget line data.
>

Point taken.

> So, considering (2), I honestly ask for clarification: what sort of
> value-add does the OSDP provide for me as someone who wants to make really
> awesome OLAP cubes out of budgets?
>
> They (OSDP and awesome OLAP cubes) are not mutually exclusive.
>

Never claimed that, I am just exploring if and how the intermediate step of
producing OSDP is relevant toward the goal of an OLAP representation of the
data. (To say it in OSv2 lingo: does OSEP7 depend on OSEP4?)

> > (On that note, I would really enjoy having a conversation about extended
> metadata, cf.
> https://github.com/mapthemoney/spendb/blob/master/contrib/MODEL.md#dataset-level-metadata
> ).
>
> You do realise that most of this is in Budget Data Package, right? It is a
> great list though, and we should definitely discuss it specifically. If you
> don’t want to discuss in the context of OSEP-04, that’s fine, but it would
> be awesome for me to use such a discussion to make OSEP-04 as good as it
> can be. Do you want to start a new thread, or do it on IRC, or, have a
> hangout with a few interested parties?
>

Yep, I do know that a lot of this overlaps. My goal is basically to
establish a dataset browser that does not primarily focus on dataset
titles, and instead gives me a good overview of what part of a governments'
budgeting cycle is expressed in any dataset, and how recent and complete it
is.

I'd love to hang out in IRC about this; currently in transit but should be
back on the web proper by 4pm CET?

Cheers,

- Friedrich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20150430/764acfa0/attachment-0002.html>