[openspending-dev] OpenSpending - Thoughts on Approach and Architecture

Rufus Pollock rufus.pollock at okfn.org
Fri May 3 17:24:47 UTC 2013


On 3 May 2013 17:22, Tryggvi Björgvinsson <tryggvi.bjorgvinsson at okfn.org> wrote:
> On fim 4.apr 2013 20:28, Rufus Pollock wrote:
>> 1. OS provides a single central repository of open data on government
>> (and corporate) finances
>> 2. OS provides good access (APIs, dumps) but quite basic presentation
>> of that data (browser, some viz)
>> 3. Most of the presentation of that data happens on non-OS sites but
>> using OS data (via the API, via dump etc)
>>
>> Some of 3 may be done by members of the "OpenSpending" community and
>> we care a great deal about 3 (that stuff is the point of having 1+2)
>> BUT OS, at least as a technical project, is focused on 1+2.
>
> Yes, openspending for me is about helping users to analyse financial
> data and publish the results. We don't have to be on the publishing end
> (just make it easier).

I think I'm a bit confused by the term "publishing". My point is that
OS here is *not* *directly* about helping users analyse and publish
their results - just as OpenStreetMap is not directly about helping
users analyse and publish geodata. Rather it is about creating a
consolidated *open* database of information that others can easily
contribute to and use (to do analysis and presentation).

>> This means OpenSpending technically is about:
>>
>> - DB: Maintaining that central repository (note this need *not* be a
>> classic relational DB - it could be files on s3 or ...)
>> - ETL: Providing means to get data into that repository (ETL)
>> - API + Dumps: Providing means to get data out of that repository
>> - Viz: providing off the shelf visualizations
>> - Analytics: providing ways to do analysis on that data
>
> I think we have to split this into two parts:
>
> 1) the openspending software (or software projects), and
> 2) the openspending.org site
>
> Even though they are really coupled I think this distinction is
> important, one is about technical development the other about service.
> The software powers the service. If we don't make this distinction we
> might make suboptimal decisions about the software only to provide some
> specific service (and the software might become a huge bloated monster
> that tries to do *all the things!*).
>
> The goal of the software is to provide a rock-solid, fast analytical
> machine for financial data. The goal of the site is to provide a central
> repository for data (including discussions on how to understand that data).

Hmmm. I think I was trying to say something a bit different:
specifically that the purpose of the "OS" project (and hence
associated software) should *not* be to produce a fast analytical
machine for financial data. Others will do that far better and the
range and type of analytical requirements is too large for us to
effectively support. Instead, our goal should be to provide a
"rock-solid" database (in the broadest sense - not necessarily a
RDBMS) and related tooling (to get data in and some examples of how to
get data out and displayed and analyzed - but the latter tooling will
likely to be fairly limited).

> In my opinion the API and ANALYTICS are a part of the openspending
> software (we might even go further and split these into three
> differently focused projects and imo that's a no-brainer split), the DB
> and ETL is a part of the service and VIZ should be a part of a separate
> openspending visualisation library (openspendingjs).

I'm not sure I understand the distinction completely but the key thing
for me is what we do and focus on.

[...]

> Here's how I envision it (I'm leaving things out but focusing on the
> general picture):
>
> |        Database        |  --  | Business logic |  --  |
> Presentation    |
> +------------------------+     +-------------------+
> +----------------------+
>   The central db of            openspending              Something
>   openspending.org              software                           +
>
> openspendingjs

A key point for me is that much of the presentation gets done by
others (just like OpenStreetMap). I also think it is important to see
OpenSpending as a project with some software to support it than as
primarily a piece of software. I say this because it means this isn't
just a classic software product.

> For me a clear separation between these parts is essential, although
> they all rely on one another. More focused efforts create a better
> community (aka ecosystem). Nifty visualisations can be merged upstream
> to the openspendingjs even though they aren't "basic presentations" and
> we wouldn't have to consider it from the service point of view. Our
> service might not even use the new visualisations.

Big +1 on more componentization though would like to avoid any huge
refactorings at the present :-)

>> Note that on Viz and Analytics we would imagine only providing limited
>> functionality of the demonstrator or essential kind - there are lots
>> of visualizations and analyses that can be done and many ways to do it
>> and OS as a technical project will only do a little.
>
> Maybe I don't understand what you mean by limited functionality of
> Analytics but I would say this is a core part of the openspending software.

No, that's the exact point of my proposal. To step back from that more
- perhaps we have a dedicated tool for this but I don't believe we can
do that in core given resources (and why would be trying to build our
own analytics tool that scales well to millions or billions of lines
of rows ...)

>> Aside: analogies with OpenStreetMap. I continue to find analogies with
>> OSM incredibly useful. Few people see OSM data or maps via
>> openstreetmap.org. Instead they see or use that data in sites or
>> products elsewhere (e.g. FourSquare). OSM's core is the central DB,
>> the data adding tools and the API/Dumps. Viz even in the form of
>> essential things like mapnik and tile production now largely happens
>> in other projects that are a part of the community but not OSM "core".
>
> Yes and OpenStreetMap can be split into really small and focused
> projects. It becomes an ecosystem of projects instead of one huge
> project that does everything.

Agreed on the more components. But also point of focusing on some parts.

>> ## Implications
>>
>> There’s more to think through here. These are just some immediate thoughts
>>
>> 0. The DB is not necessarily a (relational) DB
>>   - We need something that we can reliably store into not something
>> that does all our analytics too. This could be flat files in s3
>
> Yes, I agree. This is imo because the DB is openspending.org's service
> while the software itself shouldn't be dependent on where the data is
> stored (centrally). The software should just provide the analytical
> processing api (the OLAP cube). The service portion needs to ask whether
> flat files in s3 are fast enough etc.

No the DB would not necessarily provide *any* analytical processing.
That would be separate or not done at all.

[...]

>> 2. We should not care about OS.org traffic or SEO for normal users.
>> What we care about is API usage.
>>   - We should start measuring API usage asap ...
>
> Yes. Measuring this is vital to the project but the project also needs
> to be discoverable -- that doesn't have to be via SEO tricks, it could
> also be just via word of mouth :-)
>
>> 3. Enabling people to build satellite sites or embed viz is our priority
>>   - We have made huge strides in this direction ... but we can do more
>>   - E.g. why focus on satellite sites in wordpress
>>   - Make it easier to get data slices
>
> Yes. We shouldn't really recommend one single approach to building these
> sites. We should make it really easy to just visualise and publish the
> results however people like.
>
> What do you mean by easier to get data slices?

I mean getting a given month for a given dataset. Or even month by
department for e.g. UK 25k.

Rufus




More information about the openspending-dev mailing list