[openspending-dev] Micro-services: OpenSpending's future architecture

Stefan Urbanek stefan.urbanek at gmail.com
Mon Dec 22 21:32:43 UTC 2014

Hi Tryggvi, Friedrich, Rufus,

I’m happy to see this initiative of breaking things into smaller pieces being proposed. Open-spending is an early-stage data warehouse where the current requirements are beyond it’s capabilities. We had a nice brief overview chat with Friedrich about it on the IRC, see transcripts at [1]. Here is my first follow-up after the discussion.

I’ve looked at the architecture at [2]. We need to elaborate the “storage" and “analytics" boxes, as this is the core of the OpenSpending, the rest is just interface between the platform and hardware or the platform and humans, if I can oversimplify it. The whole data architecture perspective is missing there. What we need to do or define?

1. Definition of the master data store entities: mappings, enumerations, data catalogue (used by the ETL and DQ)
2. Data quality metrics and processes to measure data quality at every stage of the data processing
3. Logical distinction of data based on their maturity and quality
4. System structures (to support proper DQ and ETL provenance)

DQ and MD goes hand-in-hand. You can’t have good MD if you don’t do DQ, you can’t do proper DQ if you don’t have any reference to base it on (you can do only the trivial ones).

To assure good data architecture components mentioned above, we should follow few rules:

1. Listen to the “customers” – what are their needs and what are their capabilities, build around them
2. Grow from the bottom, but have a general scaffold  [3] – don’t over-design, but don’t do wild growth [4]
3. Prepare the DQ infrastructure first – it is something like unit-testing, but continuous, for your data based on metadata
4. Use metadata wherever you can – for your ETLs and DQ there are going to be a lots of repeatable patterns, also you can have non-tech user oriented interface to them, so you will not need to do a lots of low-level hacking to influence the ETLs

The whole process will involve lots of modelling, modelling and also modelling. Quite a lot of remodelling in between. Reporting requirements might change, sources might change, master data might change, the platform has to be ready for that → the way to go is the way of metadata.

There is one more concern that needs to be addressed very early in the design stage. OS has great potential of becoming the referential source for spending/financial data. This kind of data has to be trusted. To increase trust in the data attention should be paid to the DQ a MDM.  It would be hard to add it later, the best way is to design around it from the very beginning.

Start thinking about how we can address that.

From data governance perspective:

* What from the above already exists, even just conceptually, in the current OS?
* How the quality is managed now? What dimensions are being observed?
* What would be good starting point data quality indicators?

From conceptual modelling/master data perspective:

* What are the core master data entities? What is their difference if compared with the Budget Data Package (BDP) [5]?
* What are the existing mappings and enumerations and how are they used? What is good/wrong with them? What is good/bad with the mapping/consolidation process?
* What are the challenges of transformations between local and shared master data?

We can move forward after this assessment.

At the end, short quote related to the master data from our discussion to remember:

<Stiivi> just for clarification: master data does not mean single way of classifying things, it means rather “known, agreed-upon, explicitly documented” way(s) of structuring and classifying, so we might have more than one ways
<pudo_> yeah, definitely :) I think trying otherwise could start wars

I have a two weeks of hack-holidays now, will try to come up with some doodle-requests for comments. I am also open to a hangout (I’m in Pacific time during those two weeks, then will be back in the Eastern time).



[1] https://botbot.me/freenode/okfn/2014-12-22/?msg=28143608&page=1
[2] https://github.com/openspending/osep/blob/gh-pages/01-approach-and-architecture-of-openspending.md
[3] http://img.wikinut.com/img/s9z.350dko.fkeyx/jpeg/0/Garden-Peas.jpeg <http://img.wikinut.com/img/s9z.350dko.fkeyx/jpeg/0/Garden-Peas.jpeg>
[4] http://www.theprepperjournal.com/wp-content/uploads/2013/08/WeedsInYourGarden-708x404.jpg <http://www.theprepperjournal.com/wp-content/uploads/2013/08/WeedsInYourGarden-708x404.jpg>
[5] https://github.com/openspending/budget-data-package/blob/master/specification.md <https://github.com/openspending/budget-data-package/blob/master/specification.md>

I brew data
Home: www.stiivi.com
Twitter: @Stiivi

> On 22 Dec 2014, at 09:16, Friedrich Lindenberg <friedrich.lindenberg at okfn.org> wrote:
> Hey Tryggvi, 
> thanks for writing up these thoughts, I think this is an incredibly valuable discussion for us to have around OpenSpending. In many ways, I agree with you: I also believe that OpenSpending would be a better piece of software if it was more modular.
> That change would help us to define great APIs, make the code base clearer and perhaps it would also lead to more contributions - although I can't help to consider a hypothesis, rather than a natural conclusion.
> I think the main reason OS hasn't gathered massive numbers of contributors is that the intersection of people who a) know about BI to some extent, b) care about public finance from a civic point of view, c) care about open source and d) aren't into starting their own thing is just very, very small.
> What you are describing is a very appealing vision - a notion of small pieces loosely joined. It represents the best of FOSS design.
> Unfortunately, I'm not sure that FOSS design is what's going to help OpenSpending have real-world impact. I believe your choice of primary target audience (developers and data wranglers) is determined by OKF's financial constraints and not by looking at the kinds of problems which OpenSpending could help to solve.
> I also think that this particular part of the FOSS ethic needs to be reformed badly. And I think that OpenSpending could be a great case study in doing so.
> In the open source community, the idea that centralisation is bad has turned into a sort of anachronistic dogma. While the commercial world has discovered that centralised offerings can provide great value to users (and advertisers), that realisation is semi-forbidden in open source land. If everything must follow the UNIX philosophy, then the thing that's really left for the open source community to innovate in is systems stuff, ie. actual UNIX. 
> There are two exceptions to this: Wikimedia, mostly because the attempts to decentralise Wikipedia have been so horribly bad (anyone remember levitation?), and the large scale dissemination of pornographic movies (aka BitTorrent). The latter is being eaten up by centralised services like Netflix, Spotify, RedTube.
> Whats underlying this is that the open source community still hasn't found a way to provide web-based, end user-facing services. If anything will make open source largely irrelevant to the web at large, it's this.
> [[
> A random example: Mozilla is trying to compete with Apple and Google on building a smartphone system. It turns out, though, that a smartphone system isn't really a piece of software that runs on a handset. It's a large set of orchestrated services (location, profile, social, ...) that your handset connects you to. When I attended their summit last year, they had internal screaming matches about whether Mozilla should provide these (and thus become a large-scale data hoarder, just like it's competitors).
> Similarly, things like Diaspora just die because they represent bad service design. Redecentralize [1] is a list of things that I am deeply sympathetic with on an ideological level - but I don't think I (or most of my friends) actually use a single one of these tools. 
> [1] https://github.com/redecentralize/alternative-internet <https://github.com/redecentralize/alternative-internet> 
> ]]
> So what should FOSS do? I believe that we need to start being serious about providing open source, openly licensed, centralised services. These services may be provided by open source platforms, but the platform in itself is just not enough. 
> Technologists - especially us at OpenSpending - have this notion that we can get away by just providing a platform. Others will then use it to provide end-user services with our platform's data. This has actually worked at least once, with OpenStreetMap.
> But I just can't see very much evidence that it actually applies to OpenSpending. The people who provide analytical services in this field - let's name SpendNetwork and OpenGov.com - don't actually need to access our large repository of data (or our APIs). Their customers are cities, and these cities bring their own data (and APIs are easy to code).
> This makes OpenSpending unlike OpenStreetMap, and it makes developers an unrealistic and unwilling target audience for the project. I think the budgetary constraints on OpenSpending have lead to a shift in thinking. The discussion you're now having is not what problems need to be solved, but: which ones are cheap to solve. Putting the code for a bunch of APIs on GitHub and storing lots of CSVs on S3 is incredibly cheap, I'm just not sure whose problem it solves.
> OpenSpending could be a strong open source service, if it did two things: a) actually start thinking even more about who it's end-users are and start to provide them value, and b) convince a set of funders to financially support the site until something fundamentally better is available.
> OpenSpending, if it is addressed (directly and through it's satellites) at citizens, journalists and policy analysts, is a public service. It needs to find a funding mode that reflects this: grant funding, perhaps even public funding.
> OpenSpending, if it is addressed at a group of "other developers" who magically need it's services and data yet don't face the same kind of constraints OKF has and instead provide great public services, is a fiction. 
> So, in summary: yes, let's make OS a modular application, because it's the right thing to do. But let's not adopt the idea that a modular set of tools is a replacement for a user-facing web service in 2014. Let's find a model for OS to have an impact that doesn't involve the open source narrative prop of "other developers" who don't have our problems. 
> I apologise for the length of my response. 
> Cheers, 
>     Friedrich 
> On Mon, Dec 22, 2014 at 3:43 PM, Tryggvi Björgvinsson <tryggvi.bjorgvinsson at okfn.org <mailto:tryggvi.bjorgvinsson at okfn.org>> wrote:
> Hi all,
> Warning, long and quite theoretical, but still important to discuss. For
> the short version please see the updated OpenSpending Enhancement
> Proposal #1:
> https://github.com/openspending/osep/blob/gh-pages/01-approach-and-architecture-of-openspending.md <https://github.com/openspending/osep/blob/gh-pages/01-approach-and-architecture-of-openspending.md>
> Conway's Law says: "A system reflects the organizational structure that
> built it".
> In its essence this is about how communications between team members
> affects the architecture of the software they're building. Conway's Law
> is very important for managers who are organizing the team. It tells
> them that they organize the team so that its communication lines reflect
> the software. If the system architecture changes, the team also needs to
> change.
> The problem with these recommendations is that they don't reflect open
> source development. In open source development projects, there is no
> manager responsible for putting the team together or re-organizing the
> team members. In open source software development the team is organic,
> self-selected. People join and help out (in the capacity they can) when
> and because they're interested.
> So from an open source software development point of view we have to
> turn this around. *The system should reflect the organizational
> structure that can build it*. It's my hypothesis or conjecture or
> whatever. What I'm trying to say is that it's the other way around in
> open source software development: the system needs to be designed to
> accommodate for the community that uses and will therefore build it.
> OpenSpending's community is complex. We all approach OpenSpending from a
> different perspective and different use cases. We all have specialised
> needs that have to do with how we interact with OpenSpending and what we
> expect to get from it. The current system however reflects an
> organizational structure of a close group of team members, who
> communicate internally (which is kind of how OpenSpending came into
> being). But that's not what OpenSpending should be. We're an open source
> software project. As such we create software that should be able to
> service many stakeholders, who communicate together publicly (scratch
> the itch and all that).
> So to cut this introduction short and save it from being too academic
> and boring. I want to propose a different system architecture for
> OpenSpending, one that invites a bigger community to participate and
> more aptly reflects the organizational structure we want.
> This new architecture is a micro-services solution. Lots of smaller
> components that can talk together via defined protocols. A system that
> can be extended to scratch itches and be simple enough to allow people
> to jump into a small project without having to dive through a mountain
> of (directly) unrelated code. So this takes David Parnas' information
> hiding to the system level (we're not inventing the wheel here).
> This triggers more software engineering theory goodness in me (which you
> may or may not interest you as much as me). We've kind of covered
> Parnas' Law: "Only what is hidden can be changed without risk". Parnas'
> Law is a two-edged sword. For us this would allow us to without much
> risk of causing a butterfly effect within the OpenSpending code base but
> it also makes it very important for us to have a good think about what
> we expose (so we don't change it very often because that's risky). So
> the interface is very important to get right but the flow behind it is
> something we can iterate fast on.
> Another law that touches this architectural change is Lanergan's Law:
> "The larger and more decentralized an organization, the more likely it
> is that it has reuse potential". It's still worth iterating even though
> that's what we developers usually do. The micro-services should be as
> general as possible so that they can get re-used. Who knows if they can
> get used outside the OpenSpending community and we'll have an even
> bigger group of people helping out with maintenance and development.
> Again this is kind of a reverse of the law. Let's design with reuse
> potential so we can have a large and more decentralized organization.
> And lastly in this probably-only-exciting-to-Tryggvi software
> engineering theory, a word of advice from DeRemer's Law: "What applies
> to small systems does not apply to large ones". Breaking things down in
> this way may not end up as being a more manageable system. It will be a
> bumpy road because experiences may not necessarily be shareable between
> micro-systems or for the overall system, but this architecture may
> instead invite a bigger development community, i.e. a bigger team that
> can share the burden. Hopefully some with experience in large system
> designs and others with experience in smaller systems, so different
> experiences and interests are going to be needed.
> Alright enough with theory! What are we going to do?
> To sum the architectural change up, we want to: *Centralize data and
> De-centralize presentation*.
> If you want to follow along you can take a look at the images in the
> OpenSpending Enhancement Proposal #01:
> https://github.com/openspending/osep/blob/gh-pages/01-approach-and-architecture-of-openspending.md <https://github.com/openspending/osep/blob/gh-pages/01-approach-and-architecture-of-openspending.md>
> This makes it quite difficult to talk about the OpenSpending platform
> because there will be no "central platform" per se, only a central
> repository of data, plus some subdomains to expose services. It's
> probably therefore better to think about the software itself in terms of
> the OpenSpending repos on github.
> The overall architecture is that we would split OpenSpending into three
> "layers" (in the images marked by mostly OpenSpending stuff, some
> OpenSpending and some "others stuff", and "Mostly "others" stuff:
> * Input and storage of data
> * Information retrieval and analysis
> * Presentation and external sites
> We propose that we put most of our focus going towards into receiving
> and storing raw data. That's the underlying building block. Without the
> data we have nothing. So rock-solid input of data, standardized formats
> to make it all usefull outside the context a single user wants to use it
> in. So the focus here would be in a Budget Data Package importer
> (standardized data). And storing the Budget Data Packages in something
> like a flat file storage (s3). Hook all of that into some permission
> system we devise and validation etc. This does not mean that we can just
> ignore everything that' not a Budget Data Package, so we'll need another
> importer which for example would map onto a Budget Data Package, at
> least to begin with, but imo we should focus on the BDP importer.
> Then we would also but a lot of power into the analysis of the data and
> making it accessible (but in such a way that it supports various and
> distributed presentation modes). However here we would expect others to
> also do things which wouldn't be in the OpenSpending github organisation
> repos. So the OLAP cube OpenSpending now imports, models and maps
> everything into, would happen in this area, but of course in a different
> way than what it currently does. We would now base everything off of
> standardized data and automatically import into the OLAP cube and
> perhaps build standardized aggregation queries and cache them properly.
> There could of course be others who want to use something else like
> Hadoop or something to analyse the data in a different way and they
> could. The raw data we serve (previous layer), is centralized but
> anybody can use it in the way they like.
> The services we would focus on would be an OLAP cube with standard
> aggregations, search (a different implementation probably than what we
> currently have) and SQL-like arbitrary queries to provide more
> professional access to the data where you could join datasets and things
> like that. We would front this with an API so the analysis bit isn't
> directly accessible, i.e we won't give anybody direct access to backend
> systems (but do it via an API micro-service), just like we wouldn't give
> anybody direct access to the data storage.
> The presentation layer is where we would put least of the focus, except
> only on supporting services/software solutions. We would leave this
> layer mostly up to "others" (which would still probably be part of the
> OpenSpending community). By that we mean that we wouldn't have many
> repos for presentation things (and move those we now have elsewhere)
> except for a few very general or specific ones. A general ones would be
> templates that people can use to build their own budget visualisation
> sites (like Where Does My Money Go?) or plugins like our WordPress
> plugins or the CKAN plugins (on the principle of trying to reduce the
> information hiding/exposure risk). In this layer we would also have the
> OpenSpending.org website but a simpler version of the current one.
> Basically just as a frontend for or link to some micro-services,
> introduction to the project etc.
> In between the reading of data (either raw data or analysis results via
> the reading api) we would provide some budget visualisations, mostly via
> OpenSpendingJS, but it would be open for others to implement their own.
> I put these into the information retrieval and analysis layer because
> they wouldn't be able to stand on their own and would be used by the
> presentation layer and require special knowledge of budget
> visualizations (e.g. adjusting for inflation when comparing across years
> etc.) So in a way it's a reading thing and something I expect us to
> provide most the core part, but yeah it could also be in the
> presentation layer. It doesn't really matter that much where we put it
> as long as we all understand what our role as a developer community is
> in providing these services.
> I think this email has already become too long so I'm going to stop for
> now and give you some room to think and contemplate but there are a lot
> of decisions we need to think about going forward, if this is something
> we agree on:
> * Integration layer between components (HTTP/Message queues/Carrier
> pidgeons)
> * How to build common components that can be re-used by all (or what we
> can re-use from others)
> * Preferred development language of components (preferred, not one to
> rule them all)
> * How and where to start? (what components should we start work on)
> * Migration of older datasets (existing one in OpenSpending, can we
> focus on BDP at all?)
> * Design of each component (probably separately by those who want to
> work on it)
> * Code and communication conventions for the dev community (common
> guidelines - we're a group!)
> * Lot's of other things
> So, isn't it best to say that I'm interested in a discussion by asking
> the question: What do you think?
> /Tryggvi
> _______________________________________________
> openspending-dev mailing list
> openspending-dev at lists.okfn.org <mailto:openspending-dev at lists.okfn.org>
> https://lists.okfn.org/mailman/listinfo/openspending-dev <https://lists.okfn.org/mailman/listinfo/openspending-dev>
> Unsubscribe: https://lists.okfn.org/mailman/options/openspending-dev <https://lists.okfn.org/mailman/options/openspending-dev>
> _______________________________________________
> openspending-dev mailing list
> openspending-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/openspending-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/openspending-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20141222/ae04938d/attachment-0002.html>

More information about the openspending-dev mailing list