[wdmmg-dev] Proposal: Architectural Changes - Part I. (long)

Stefan Urbanek stefan.urbanek at gmail.com
Fri Jul 8 00:36:24 UTC 2011


Hi,

This is a longer email, part I. of OpenSpending architectural changes - proposal. It contains:

1. Changes Overview –
2. Data Sources and Storage Changes
3. Architecture Proposal

It is drawn from best practices and our discussions during workshops. Does not contain implementation details yet, it is more description of "desired state" that we will try to achieve with small incremental changes.

Google Doc version (open for editing) of this email with diagrams included is here:

https://docs.google.com/document/d/1n6I-3ULNtftaMpym00sadvbW3IahJxi0X3vQNjO7P4Q/edit?hl=en_GB

Feel free to leave inline comments, questions, references, worries in the document or inlined in the email reply (I will add them to the google doc).

Part II. will describe the ETL process and master data management in more details.

Regards,

Stefan

Here it comes:

Overview of Architectural changes
---------------------------------

* legacy application package wdmmg and wdmmg-ext will be split into more logical and functionally coupled
  packges:

  * OpenSpending Web - web front-end application, with interface for pluggable visualisations
  * OpenSpending ETL - extraction, transformation and loading package
  * OpenSpending OLAP - online analytical processing package
* main OpenSpending packages should focus on "core business" - that is providing analytical insight into
  spending data, either through web based interface, search engine or API. 
* analytical dataset is read-only for web application

Changes required:

* remove non-web code from the wdmmg package. *Reasons*: cleaner packages, easier to maintain
* separate raw/analytical data layer: well defined API to get analytical and raw data from data backend –
  *Reasons*: more transparent architecture, easier to maintain and track-back bugs (software or data); more
  architectural flexibility – easier to make changes; easier to introduce new analytical functionality
* web application should not depend on data structures and physical data storage, all queries should go
  through raw/analytical data API – *Reasons*: future changes require less work, are less prone to errors;
  scalability is matter of backend not of application
* remove dataset specific code and all dataset specific extraction, transformation and loading code should be
  delegated to third-parties. *Reasons*: keeps code clear; separation forces to use encapsulated API oriented
  architecture, which is more transparent and easier to maintain; encourages reusability of ETL code

Data sources and storage changes
--------------------------------

**New functionalities:**

* master management will be introduced - process for creation, maintenance, update of lists, enumerations,
  dimensions, classifications, mappings, etc.
* ETL process monitoring will be introduced – all ETL jobs should be auditable through single place

**Changes:**

* split current application data store into multiple stores depending on data stage: source mirror, staging
  area/operational data store, datamart area, application specific data (described later). *Reasons*:
  transparent data processing process; modular architecture wich is easier to maintain; provides basis for
  better data quality – data quality issues are easier to track; open for scalability at any given level
  without need to rewrite the rest of the OpenSpending system(s)
* each dataset in OpenSpending should have a corresponding CKAN package – *Reasons*: transparency of data
  sources; availability to third parties; crowd-sourced source data auditability; last but not least
  (non-openspending related) - encourages use of CKAN and serves as use-case example of CKAN usage
* most of master data sources (classifications, lists, enumeration) should be available as CKAN packages as
  well, mainly list of entities, classfications – *Reasons*: same as reasons for dataset source being stored
  in CKAN; introduces better reusability of classifications at source level – potential data providers can be
  pointed to open and public existing classifications to make their data comply with OpenSpending
  requirements.

Restrictions:

* no direct data change in database should be done manually, every change should be introduced in ETL
  process, even small bug fixes – *Reasons*: transparency; better data quality; data reconstructability
* whole data mart should be regenerated by running required ETL scripts – *Reasons*: in case of data loss
  (database attack, hacking, manual changes), whole datamart can be regenerated to the same state as it was
  before just using scripts

Architecture Proposal 1
=======================

Top-level modules from user's level down to data level:

* OpenSpending Web Application
* Online Analytical Processing
* ETL

External modules and dependencies:
* CKAN – not part of OpenSpending, but required as data source
* third-party ETL packages – created by *"data wranglers"*

OpenSpending Web Application
----------------------------

* provides web interface for visualising, browsing and searching OpenSpending datasets
* has notion of "reports" – description of how datasets are being visualised and analysed, which dimensions,
  measures and post-computations are used
* might contain application data store for storing "transactional application data", such as comments,
  taggings, customized reports
* web application accesses analytical data in read-only mode

Online Analytical Processing (OLAP) Module
------------------------------------------

* provides interface for querying analytical data: from facts to aggregates
* has well-known read-only interface which serves as the only way for accessing open-spending datasets from
  the end-user web aplication

Interface should provide functionalites:
* retrieve detailed information about spending (fact/detail)
* retrieve metadata describing a dataset entries
* retrieve list of dimension entries (such as list of entities, list of classifications, ...)
* aggregate measures through dimensions

API responses should be in the most appropriate format:
* Python objects when called from python
* JSON when called using HTTP API
* CSV through HTTP where appropriate

There might be possible future analytical capabilities of the OLAP backend:

* pivot tables – cross-table structures with dimensions in rows and columns, possibly paginated by another
  dimension; should be provided in the most appropriate form that can be immediately used with result
  consumer without any major transformation (for example: JSON reply should contain list of row and column
  titles and then table cells as list of rows where row is list of column cells, already sorted according to
  pivoted dimensions)
* histograms
* pre-computation of measure-based classification dimensions (for example: low/medium/high expense)
* association mining (apriori/shopping basket) – example use from spending data: "what combination of
  subjects of contracts is most common?"
* etc.

Note: search engine might or might not be part of OLAP module, recommended is indexing out-of OLAP module
with references to OLAP objects (multidimensional aggregates, detailed facts)

*Data Storage*

Implementation of data storage for OLAP should be the most appropriate with regard to: available software,
knowledge of developers, maintainability, auditability and ease of use for analytical processing. 

The OLAP data store should be abstracted and hidden from application – no direct access should be allowed,
analytical queries should be done only through analytical API. Abstraction makes architecture cleaner, easier
to maintain and more transparent, therefore more auditable. Also is open for future changes, such as decision
to change analytical store for scalability or feature richness reasons.

There are two actual possibilities of data store backen:

* relational database (ROLAP)
* mongo-db

Preference is for ROLAP for following reasons:

* high availability of knowledge and best practices for ROLAP, well tuned schemas and processes
* very easy to implement
* SQL is very good for ad-hoc queries – allows wide variety of fast ad-hoc reports on top of structures
  created by ROLAP backend

Stefan Urbanek

senior business intelligence consultant
http://knowerce.com







More information about the openspending-dev mailing list