[wdmmg-dev] Import format

Fri Jun 17 16:34:22 UTC 2011

Dear all,

These are a few thoughts about the OpenSpending import format. They
come with the significant disclaimer that I don't know the
OpenSpending backend in detail, but are primarily motivated by
experience wrangling the UK Research Councils (NERC/EPSRC) grants data
into OpenSpending-compatible format.

## Where we are now

Imports comprise two CSV files:

  1. a data file in which each row represents a spending transaction:
in WDMMG-speak, an `Entry`.
  2. a metadata mapping file, which describes the fields in the data
file. This is translated automatically to a JSON-format mapping file.

For examples of these files see a [small slice of the NERC data
file][1], and the corresponding [metadata mapping][2].

[1]: http://j.mp/nerc-os-data
[2]: http://j.mp/nerc-os-mapping

## Issues with this format

The requirement that the data fits into a single CSV file necessarily
imposes some constraints on the kind of data that OpenSpending can
receive. The most obvious and immediate impact is that of
"denormalisation": if my dataset contains 5000 rows paid to "Joe
Bloggs, University of Bognor Regis", and 5000 rows paid to "Jane Doe,
Slough University", then each of these strings will necessarily appear
5000 times in the data file, despite the fact that each one is only
conveying 1 bit of information per row.

In this simple example, there's no big problem -- the only direct
consequence is an increase in data file size. That said, even this
complaint could be fixed by the ability to have two data CSVs:

### payees.csv
id,name
1,"Joe Bloggs, University of Bognor Regis"
2,"Jane Doe, Slough University"

### entries.csv
to,value
1,3000
1,4200
2,6000
[...]

In order to deal with this data, we would need to know that
`entries.to` corresponded to a `payee.id`, a fact that would
presumably find its way into a suitable metadata file -- I'll come
back to this later.

This is probably an overengineered solution to this simple case -- a
far more sensible response would just be to ignore the problem and
accept slightly larger data files.

Unfortunately, there are other cases when denormalisation causes more
significant issues. Extending the example for grants data, I now
consider a scenario in which each grant is paid to a `person`, who in
turn works in a `department` of a specified `institution`. Given data
that associates each `entry` with all three, I might well want to
aggregate by any one of {person, department, institution}. But now I
have a problem, because both "University of Bognor Regis" and "Slough
University" have departments of "Media Studies". Naively, then, my
data file would look like this (ignoring the "to" field for clarity):

### entries.csv
value,person,department,institution
3000,Joe Bloggs,Media Studies,University of Bognor Regis
4200,Joe Bloggs,Media Studies,University of Bognor Regis
6000,Jane Doe,Media Studies,Slough University

The problem is clear: if I aggregate by `department` I'm going to get
bogus results, because only the pair (`department`, `institution`) is
guaranteed to be unique.

How can we solve this? If you look at the [NERC data][1] from earlier,
you'll see my (hackish) solution with the current format: force the
`department` to be unique by including its unique id from my database.
This is ugly, and there's no reason users of OpenSpending should have
to see "Media Studies (456)".

How should we solve this? The solution is trivial with multiple data files:

### institutions.csv
id,name
1,University of Bognor Regis
2,Slough University

### departments.csv
id,institution_id,name
1,1,Media Studies
2,2,Media Studies

### entries.csv
value,person,department
3000,1,1
4200,1,1
6000,2,2

This pattern will be familiar to anyone who's ever worked with
relational databases. The ability to use unique ids in the main data
file (entries.csv) means I can separate semantics ("To which
department am I referring?") from appearance ("How do I want this
department to appear on the website.").

Lastly, consider how much worse this problem gets if `people`,
`departments` and `institutions` have other properties, such as
`website`, `geographical_region`, `latlng`, `private_public_status`.
With a multiple-file format, these are non-issues.

## Metadata format and simpler cases

I should stress that I *absolutely* agree that the format should be as
simple as possible for the 80% case. Many (probably most) datasets
won't have this problem -- especially the scruffier ones. But there's
nothing to stop us from dealing gracefully with *both* single-CSV
cases, and zip-of-CSV cases.

Google's exploits in this area have led to DSPL, the [Dataset
Publishing Language][3], which is an extant specification which deals
with these issues, and provides a metadata format for expression the
relationship between the different CSVs.

[3]: http://code.google.com/apis/publicdata/

I'd greatly appreciate any and all thoughts on this. This email is now
quite long enough: signing off,

Nick