[openspending-dev] OpenSpending - Thoughts on Approach and Architecture

Fri May 3 16:22:45 UTC 2013

On fim 4.apr 2013 20:28, Rufus Pollock wrote:
> 1. OS provides a single central repository of open data on government
> (and corporate) finances
> 2. OS provides good access (APIs, dumps) but quite basic presentation
> of that data (browser, some viz)
> 3. Most of the presentation of that data happens on non-OS sites but
> using OS data (via the API, via dump etc)
>
> Some of 3 may be done by members of the "OpenSpending" community and
> we care a great deal about 3 (that stuff is the point of having 1+2)
> BUT OS, at least as a technical project, is focused on 1+2.

Yes, openspending for me is about helping users to analyse financial
data and publish the results. We don't have to be on the publishing end
(just make it easier).

> This means OpenSpending technically is about:
>
> - DB: Maintaining that central repository (note this need *not* be a
> classic relational DB - it could be files on s3 or ...)
> - ETL: Providing means to get data into that repository (ETL)
> - API + Dumps: Providing means to get data out of that repository
> - Viz: providing off the shelf visualizations
> - Analytics: providing ways to do analysis on that data

I think we have to split this into two parts:

1) the openspending software (or software projects), and
2) the openspending.org site

Even though they are really coupled I think this distinction is
important, one is about technical development the other about service.
The software powers the service. If we don't make this distinction we
might make suboptimal decisions about the software only to provide some
specific service (and the software might become a huge bloated monster
that tries to do *all the things!*).

The goal of the software is to provide a rock-solid, fast analytical
machine for financial data. The goal of the site is to provide a central
repository for data (including discussions on how to understand that data).

In my opinion the API and ANALYTICS are a part of the openspending
software (we might even go further and split these into three
differently focused projects and imo that's a no-brainer split), the DB
and ETL is a part of the service and VIZ should be a part of a separate
openspending visualisation library (openspendingjs).

I put it into three differently focused projects to conform with the
"traditional" development structure: | Database | -- | Business logic |
-- | Presentation |

Here's how I envision it (I'm leaving things out but focusing on the
general picture):

|        Database        |  --  | Business logic |  --  |   
Presentation    |
+------------------------+     +-------------------+    
+----------------------+
  The central db of            openspending              Something
  openspending.org              software                           +

openspendingjs

For me a clear separation between these parts is essential, although
they all rely on one another. More focused efforts create a better
community (aka ecosystem). Nifty visualisations can be merged upstream
to the openspendingjs even though they aren't "basic presentations" and
we wouldn't have to consider it from the service point of view. Our
service might not even use the new visualisations.

> Note that on Viz and Analytics we would imagine only providing limited
> functionality of the demonstrator or essential kind - there are lots
> of visualizations and analyses that can be done and many ways to do it
> and OS as a technical project will only do a little.

Maybe I don't understand what you mean by limited functionality of
Analytics but I would say this is a core part of the openspending software.

> Aside: analogies with OpenStreetMap. I continue to find analogies with
> OSM incredibly useful. Few people see OSM data or maps via
> openstreetmap.org. Instead they see or use that data in sites or
> products elsewhere (e.g. FourSquare). OSM's core is the central DB,
> the data adding tools and the API/Dumps. Viz even in the form of
> essential things like mapnik and tile production now largely happens
> in other projects that are a part of the community but not OSM "core".

Yes and OpenStreetMap can be split into really small and focused
projects. It becomes an ecosystem of projects instead of one huge
project that does everything.

> ## Implications
>
> There’s more to think through here. These are just some immediate thoughts
>
> 0. The DB is not necessarily a (relational) DB
>   - We need something that we can reliably store into not something
> that does all our analytics too. This could be flat files in s3

Yes, I agree. This is imo because the DB is openspending.org's service
while the software itself shouldn't be dependent on where the data is
stored (centrally). The software should just provide the analytical
processing api (the OLAP cube). The service portion needs to ask whether
flat files in s3 are fast enough etc.

> 1. Optimize ETL
>   - Getting data in is essential
>   - This is about people as much as tools
>   - Maximize structure and reliability

I agree with this as well. We need to decentralise the input (or
document it reeaally well). I envision plugins or standalone
applications for ETL (something like a LibreOffice extension or CKAN
plugin to upload data or applications like graphical tools or simple
perl scripts).

This spins into the ecosystem thing and works well with your OSM analogy.

> 2. We should not care about OS.org traffic or SEO for normal users.
> What we care about is API usage.
>   - We should start measuring API usage asap ...

Yes. Measuring this is vital to the project but the project also needs
to be discoverable -- that doesn't have to be via SEO tricks, it could
also be just via word of mouth :-)

> 3. Enabling people to build satellite sites or embed viz is our priority
>   - We have made huge strides in this direction ... but we can do more
>   - E.g. why focus on satellite sites in wordpress
>   - Make it easier to get data slices

Yes. We shouldn't really recommend one single approach to building these
sites. We should make it really easy to just visualise and publish the
results however people like.

What do you mean by easier to get data slices?

/Tryggvi