[openspending-dev] Overview: OS as Data Warehouse

Mon Dec 29 07:05:15 UTC 2014

Hi,

After the recent email and IRC discussions I’ve come up with this overview that might serve as a starting point for the discussions about the future architecture of Open Spending. I’m opening just few topics here that would need to be discussed in more details, for example as separate threads.

The topics covered in this email:

* data warehouse logical layout
* data quality overview

Remaining topics:

* multi-dimensional model definition (non-tech)
* master data definition (non-tech)
* data quality indicators definition (non-tech)
* ETL process and management overview (technical)
* metadata definition (non-tech + technical)
* naming conventions (technical)

Why I’m starting with the logical layout and data quality? It defines us bounds at higher level of abstraction to move within. I’m not going into technical details yet as I think we should solve the logical and conceptual aspect of the system first. Worry not, there are few low-level diagrams in the pipeline :-)

So here we go…

GENERAL REQUIREMENTS

Requirements for the data warehouse (in general):

* Presented data should be understandable to the consumers: analysts, journalists or developers for further processing
* Data can be trusted – users should trust that the data they are looking at are the data they think they are looking at. There should be means to prove that very easily and transparently.
* The labelling of the data should be consistent thorough the whole data warehouse
* Data quality should be continuously monitored and presented to the end user on demand
* The data warehouse should be easily adaptable to the changes such as new types of source, new classifications, re-classifications, demands of different kinds of reporting, etc.

(recommended read [1])

LOGICAL AREAS

I’m proposing to use adjusted traditional data warehouse layout: staging area, dimensional data mart area, metadata and master data. Due to nature of the project, source data and data use there is a very little need for ODS-like area. It might be introduced later if necessary.

See image: (source: http://cl.ly/image/1r3W021p2m42/Image%202014-12-28%20at%2010.45.49%20pm.png <http://cl.ly/image/1r3W021p2m42/Image%202014-12-28%20at%2010.45.49%20pm.png> )

The areas are based on logical division of the data. They don’t have to be necessarily physically separated in different data stores, neither they have to be stored together. However the developers should approach to them separately due to their different maturity level, quality level or their nature in general.

Every data entity (data table, metadata table, mapping, …) should be assigned to only one logical area.

*Note: The Open Spending application is just one of many possible (despite being just single at the moment) applications on top of the open-spending data warehouse. It’s structures should be kept separate from data warehouse’s structures. Main reason is encapsulation of the data warehouse which results in better maintainability*.

## Staging Area

The staging area contains all raw source data used to generate the analytical dimensional data mart. Staging area might contain several copies or versions of the same source dataset.

The data in the staging area are validated just to satisfy input data validation requirements, such as proper data types. Content should not be validated at this point – the process of content checking is left to another stage of processing, as it has different kind of nature.

Note: Currently data loadings are not real time, they are mostly on-demand, user initiated processes. Focusing on real-time aspect of the ETLs would be a distraction and very likely waste of resources in this stage of the project.

## Datamart Area

The Datamart area contains data ready to be analysed. The data contained in this area are multidimensional and there are distinguishable containers (tables) of two roles: facts and dimensions.

Since the underlying storage system will be a relational database the most appropriate schema is star or snowflake. Whether the former or later is open for discussion. Due to mostly unique nature of every classification for this project I would suggest go with star schema.

How the facts are dimensions are modelled is described in further sections.

## Master Data

The master data area contains mostly data having and forming keys. Master data is used in the process of ETL and DQ. In ETL it is used for conforming the labels of source systems to the master dimensions. In DQ it is used as a data source for formulas of data quality indicators. The main types of data stored in the area are:

* reference datasets such as country listing, organisation lists, entry categorisations (COFOG)
* mappings between sources and master data

The reference datasets can be also used as master dimensions or rather dimension can be promoted to a master dimension therefore serving the role of master data. In other words some data in the master data area might be just links to data in the datamart area.

Note: this is a *master data* area of Open Spending data warehouse (internal area), not Open Spending serving as *master data* for other projects (property of the project for external environment).

## Metadata

The metadata area contains information about the data warehouse such as:

* data catalogue – list of data warehouse dimensions, facts and other tables. Can be also used to identify stray tables and for house keeping purposes.
* multi-dimensional model – logical cube descriptions, their respective facts and dimensions
* job/task list – list of all manual and recurring tasks processing the data (ETL, DQ, house keeping …)
* job runs – registration of every run of a job with appropriate result status and message
* job run errors – log of records that caused ETLs to fail or issue a warning
* validation rules
* data quality indicators – list of data quality indicator definitions – human description, formula or reference to a function performing the set or row based check
* data quality results – results of runs of the data quality indicator computations

*Note: Mappings are not part of metadata, they are data contained in the master data area as described in the section about Master Data.*

DATA QUALITY

The data quality should be monitored through the whole process to achieve more trustful and useable data. There are at least three levels of data quality in the Open Spending project:

1. source input (format) quality
2. content quality
3. “business quality”

There are several dimensions of the data quality to be monitored at each of the levels. Quality issues might not necessarily result in blocking errors, in fact in many cases blocking might be contra-productive.

The dimensions monitored at the source input level are format compliance (field contents, data type, uniqueness of key supposed to be unique). The source quality issues are one of the blocking ones and require the data provider to fix the input before it can be loaded into the system.

The content quality dimensions are mostly completeness (do we have all months/records?), possibility of referential integrity (can we map this?), etc. Master data can be used in this process. Errors might not be blocking for reporting, but might trigger a warning. For example invalid mapping might just map to an special *invalid* member of a dimension so we can proceed with reporting with explicitly communicated knowledge of lower data quality.

“Business quality” (or "business rules") is for example data accuracy based on previously observed facts or assumptions stored in master-data. The issues are mostly non-blocking, just indicating potential issue that should be fixed at some point in the processing pipeline.

*Note: we are talking about data quality indicators that can be measured in an automated way. Non-automated data quality feedback is open for discussion*.

— END —

The tricky part of this project, as compared to traditional data warehouse projects is the nature, variety and structural unpredictability of of the data sources. Therefore we should be more concerned about the adaptability aspect than in the typical data warehouse environment. Which makes us to focus more on the metadata, master data and DQ areas than on the actual data.

It would be great if we can start digging deeper in each of the topics… Specify closer the areas, master data, … The technical details will start slowly to crystallise.

What do you think?

Cheers,

Stefan Urbanek

[1] Ralph Kimball & Margy Ross. “The Data Warehouse Toolkit.” 
—
I brew data
Home: www.stiivi.com
Twitter: @Stiivi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20141228/724d2f45/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-2.png
Type: image/png
Size: 15095 bytes
Desc: not available
URL: <http://lists.okfn.org/pipermail/openspending-dev/attachments/20141228/724d2f45/attachment-0001.png>