[openspending-dev] Micro-services: OpenSpending's future architecture

Sun Dec 28 00:53:17 UTC 2014

Hey Stefan,

Great to have your thoughts on this as well. Awesome questions you ask.

Some more replies in-line.

On mán 22.des 2014 21:32, Stefan Urbanek wrote:
> <snip>

> I’ve looked at the architecture at [2]. We need to elaborate the
> “storage" and “analytics" boxes, as this is the core of the
> OpenSpending, the rest is just interface between the platform and
> hardware or the platform and humans, if I can oversimplify it.

Agreed. I like you you approach this from the master data perspective.
In the current set up, the master data is the data in the OLAP cube (we
could think of it as the source links to the external sites but that's
not reliable enough).

In my proposal, the storage box is the master data source and the
management of it. So perhaps it would be better to call it master data
management instead of storage as that would be clearer. I don't know if
everyone agrees with me, so that's something to discuss.

> The whole data architecture perspective is missing there. What we need
> to do or define?
>
> 1. Definition of the master data store entities: mappings,
> enumerations, data catalogue (used by the ETL and DQ)
> 2. Data quality metrics and processes to measure data quality at every
> stage of the data processing
> 3. Logical distinction of data based on their maturity and quality
> 4. System structures (to support proper DQ and ETL provenance)
>
> DQ and MD goes hand-in-hand. You can’t have good MD if you don’t do
> DQ, you can’t do proper DQ if you don’t have any reference to base it
> on (you can do only the trivial ones).

Do you think the budget data package standard would work as reference
master data? I think we won't be able to map everything onto budget data
packages so we might have two different master data management services
or we might have to launch a conversion project or something.

Could you help us go through this project. Maybe we can tackle these
questions in a developer meeting.

> To assure good data architecture components mentioned above, we should
> follow few rules:
>
> 1. Listen to the “customers” – what are their needs and what are their
> capabilities, build around them

As you can see from the discussion between me and Friedrich, this is
something we need to discuss and define better before we can start
listening.

We should perhaps take this discussion to the general openspending
mailinglist because that's where the current users ("customers") are.

> 2. Grow from the bottom, but have a general scaffold  [3] – don’t
> over-design, but don’t do wild growth [4]

Agreed.

> 3. Prepare the DQ infrastructure first – it is something like
> unit-testing, but continuous, for your data based on metadata

That's a missing component and yes, it's definitely something we should
do. Paul Walsh is preparing a validation service around data packages
(something he's been discussing on the okfn labs mailinglist) which can
be useful to us in this regard.

> 4. Use metadata wherever you can – for your ETLs and DQ there are
> going to be a lots of repeatable patterns, also you can have non-tech
> user oriented interface to them, so you will not need to do a lots of
> low-level hacking to influence the ETLs
>
> The whole process will involve lots of modelling, modelling and also
> modelling. Quite a lot of remodelling in between. Reporting
> requirements might change, sources might change, master data might
> change, the platform has to be ready for that → the way to go is the
> way of metadata.

Agreed.

> There is one more concern that needs to be addressed very early in the
> design stage. OS has great potential of becoming the referential
> source for spending/financial data. This kind of data has to be
> trusted. To increase trust in the data attention should be paid to the
> DQ a MDM.  It would be hard to add it later, the best way is to design
> around it from the very beginning.
>
> Start thinking about how we can address that.

That's a big question but yes, agreed. Something we definitely need to
think about.

Let me take a stab at your questions below but others might have
different opinions.

> From data governance perspective:
>
> * What from the above already exists, even just conceptually, in the
> current OS?

As I said, Paul Walsh is creating a validator that might be useful and
there is the osvalidate code which does some of that as well:
https://github.com/openspending/osvalidate

There is scaffolding where a dataset is modelled that allows for a lot
of flexibility, but that maps onto an OLAP cube.

There is an implementation of a provenance system in place that can be
used for manual quality assurance. It's not being used at the moment but
might be worth revisiting.

We collect a limited amount of metadata I would say so there's
definitely room for improvement there.

> * How the quality is managed now? What dimensions are being observed?

There is no data quality management in place. It's kind of like the wild
west where every dataset is free standing with no validation except "can
it be modelled and imported without errors into an OLAP cube?"

Most of the use cases we have now are budget nerds who bring with them a
dataset which they model and then use in some application they build
themselves (or is built for them) so the quality assurance takes place
outside the system.

> * What would be good starting point data quality indicators?

That's a big question.

I think most people would be interested in on a higher level (although
this is based on my perception of our target audience):

* Does this come from an official source or is it collected/created by
third party?
* Has the data been modified in any way?
* Is something excluded from the data?
* Is it understandable?
* Does it contain all information for useful analysis (what would that
information be)?
* How "fresh" is the data?
* Is it comparable to other data (older budgets or budgets from others)?

This is really hard to answer automatically but we could go a long way
and define some indicators. Are you thinking of something else here?

> From conceptual modelling/master data perspective:
>
> * What are the core master data entities? What is their difference if
> compared with the Budget Data Package (BDP) [5]?

I'll let others give their opinions on this as I'm pretty biased around
the Budget Data Package.

> * What are the existing mappings and enumerations and how are they
> used? What is good/wrong with them? What is good/bad with the
> mapping/consolidation process?

Currently OpenSpending has only two required attributes: time and amount.
It recommends (but does not require) two dimensions: from (government
entity) and to (recipient)

Everything else is up to the user. This leads to OpenSpending being a
big repository of single-use data (not likely to be re-used). This is an
understandable design as there is no standard way of publishing data and
impossible to know what the input data is. The mapping process is
difficult as it requires the mapper to have a good knowledge of OLAP but
if you do it once, it can be done automatically for all future datasets
with the same headers in the csv.

> * What are the challenges of transformations between local and shared
> master data?

The problem we have with everyone just doing anything to load data means
we are very likely to not be able to convert without serious effort,
even if something like the budget data package is flexible. Also
language could become a big barrier.

> We can move forward after this assessment.
>
> At the end, short quote related to the master data from our discussion
> to remember:
>
> <Stiivi> just for clarification: master data does not mean single way
> of classifying things, it means rather “known, agreed-upon, explicitly
> documented” way(s) of structuring and classifying, so we might have
> more than one ways
> <pudo_> yeah, definitely :) I think trying otherwise could start wars
>
> I have a two weeks of hack-holidays now, will try to come up with some
> doodle-requests for comments. I am also open to a hangout (I’m in
> Pacific time during those two weeks, then will be back in the Eastern
> time).

Would you be willing to join a developer meeting where we can focus on
these questions and discussions?

We have developer meetings on the first Thursday of every month, but for
January I think we should aim for January 8 (instead of January 1) which
falls outside your two week hack-holidays. We could also arrange a
"Stefan hangout" outside the dev meetings if that works better for you.

Thanks a lot for your help on this. You're bringing awesome stuff to the
table and a lot of things we as a community need to discuss.

/Tryggvi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20141228/e01487fd/attachment-0002.html>