[openspending-dev] Experiment: flat-file aggregator "API"

Wed Apr 29 22:07:55 UTC 2015

Hey Paul,

thanks for these links, comments inline.

On Wed, Apr 29, 2015 at 10:03 PM, Paul Walsh <paulywalsh at gmail.com> wrote:
>
> Cubepress looks like a really good start on aggregate data in flat files.
> Nice work. You know that this is part of the desired implementation for OS
> v2, and it at least looks like it could be used as part of the “spike
> solution” I proposed yesterday as a way to get OS v2 rolling (
> https://discuss.okfn.org/t/2015-near-term-technical-roadmap-for-openspending/264/2
> ).
>

It had come across my inbox, but I hadn't really grokked it. Looking at
your proposal now, I only really have one question: how is this really
different from spendb? If you do a web interface, you do some sort of
authz, you do some sort of mapping, ETL progress stuff, etc. -- this is
exactly where you will end up. I can only beg of you, please have a look at
the code base: https://github.com/mapthemoney/spendb ...

>From my reading of this spec, there are two major differences in your
proposal:

1) It generates static aggregates. This is obviously also something that
I'm interested in, but I don't believe that it covers nearly enough use
cases to be a medium- to long-term solution. At best, it's a band-aid for
really small datasets. Still, I would obviously like to have a tool which
computes the number of aggregates for a given set of queries, and, if that
number is reasonable, generates them for upload to S3. That is something
that would fit really neatly into SpenDB.

2) It uses OpenSpending Data Packages (OSDP?). These are probably fun if
you're trying to build a data catalogue which keeps data uploaded by
governments. But when you want to do OLAP modelling on data provided by
other people who have to more or less deal with the data they can get, I
think the "data standard" cons outweight the pros:

First off, as Stefan Urbanek (my personal Ralph Kimball) would probably
say: the standard mainly describes the physical structure of the data, but
not the logical model by which they are to be queried (cf.
https://pythonhosted.org/cubes/model.html). The latter is required to
actually do aggregates in a meaningful way; and in terms of modelling it is
the hard part.

Second, the more semantic aspects (i.e. the "recommended" and "special"
fields in BDP) are what would make this useful for analysis, and BDP makes
this all into a really weird convention compliance problem where you

a) read the BDP spec (yay volunteers reading and interpreting dozen-page
specs!),
b) map your budgets to GFSM and COFOG (how? what happens to the original
classifications?),
c) then indicate the semantics of these items through magic column naming.

That just seems like a major regression on UX to me. We had this four years
ago in OpenSpending ("to", "from", "amount" and "time") and it was a
horrible mess. This is just a more fancy version of the same mess. Defining
the semantics of a dataset's dimensions should be a part of the user
interface, and not some weird specification text exegesis exercise and
magic file formats.

In the process, you're also likely to constrain the sorts of data that can
be imported, e.g. I can't load that contract awards data that I really want
to play with (because it doesn't use these budget-style dimensions).

> The OSEP 4 pull request Rufus refers to is one I’ve been working on here:
> https://github.com/openspending/osep/pull/13 (probably we have some work
> to do, and any comments would be welcome).
>

So, considering (2), I honestly ask for clarification: what sort of
value-add does the OSDP provide for me as someone who wants to make really
awesome OLAP cubes out of budgets?

We've established column types, and "recommended/special fields" can assist
the creation of a logical if they exist - but they will not replace it.

You might argue dataset-level metadata, but I would say that in 2015 it
would be expected for that stuff to be editable through a web interface,
rather than some file-based solution. (On that note, I would really enjoy
having a conversation about extended metadata, cf.
https://github.com/mapthemoney/spendb/blob/master/contrib/MODEL.md#dataset-level-metadata
).

Cheers,

- Friedrich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20150430/b08d9528/attachment-0002.html>