[wdmmg-discuss] [wdmmg-dev] Import format

Mon Jun 20 09:52:04 UTC 2011

Hi Nick,

thanks for this feedback which I think is spot on in a number of places.

On Fri, Jun 17, 2011 at 6:34 PM, Nick Stenning <nick at whiteink.com> wrote:
> The requirement that the data fits into a single CSV file necessarily
> imposes some constraints on the kind of data that OpenSpending can
> receive. The most obvious and immediate impact is that of
> "denormalisation": if my dataset contains 5000 rows paid to "Joe
> Bloggs, University of Bognor Regis", and 5000 rows paid to "Jane Doe,
> Slough University", then each of these strings will necessarily appear
> 5000 times in the data file, despite the fact that each one is only
> conveying 1 bit of information per row.

You're right: denormalization is required for most data loads at the
moment. In fact, I'd say that large parts of OpenSpending's design are
somehow rooted in the assertation that denormalization is not in fact
as evil as RDBMS would have us think (that's a lie, of course, but a
really interesting one).

> In this simple example, there's no big problem -- the only direct
> consequence is an increase in data file size. That said, even this
> complaint could be fixed by the ability to have two data CSVs:

[ snip example ]

> This is probably an overengineered solution to this simple case -- a
> far more sensible response would just be to ignore the problem and
> accept slightly larger data files.

I agree that having support for auxiliary sheets (or codesheets) in
OpenSpending would be somewhat beneficial in a number of cases. At the
moment, I'm regularly using Google Refine and its cross() function to
join up data prior to its import.

The reason I would approach this with care is not so much that it is
over-engineered but that it puts additional knowledge about the
dataset into OpenSpending. I think our strategy should be that
OpenSpending should know as little about the internal joins of the
dataset as possible for it to function - describing these relations
and how to act on them is prone to trigger either dataset-specific
code in the loader or Greenspun's Tenth Rule for the mapping file.

So my vote on this would be to denormalize outside of OpenSpending for
now and then to consider codesheets once we have seen a sufficient
number of examples and can come up with a simple and general
implementation both of specifying "ON" conditions and of how to
actually handle multiple resources.

> Unfortunately, there are other cases when denormalisation causes more
> significant issues. Extending the example for grants data, I now
> consider a scenario in which each grant is paid to a `person`, who in
> turn works in a `department` of a specified `institution`. Given data
> that associates each `entry` with all three, I might well want to
> aggregate by any one of {person, department, institution}. But now I
> have a problem, because both "University of Bognor Regis" and "Slough
> University" have departments of "Media Studies". Naively, then, my
> data file would look like this (ignoring the "to" field for clarity):
>
> The problem is clear: if I aggregate by `department` I'm going to get
> bogus results, because only the pair (`department`, `institution`) is
> guaranteed to be unique.

I would treat this (entity identification) as an independent issue
from the denormalization question, the two are only related when
codesheets happen to have primary identifiers in them, I think.

My current approach to this had been to differentiate 3 different
kinds of entity/classifier attributes:

1) internally identifying attributes (e.g. _id, name) which are used
for URL generation and some internal matching within OpenSpending.

2) Title and description which are mandatory but not identifying.

3) Dataset specific, identifying fields.

Number 3) deserves some explanation: in the NERC example, you could
add NERC_Institution_Name and NERC_Department_Name with the values
given in the dataset and necessary to make them unique. You would
also, independently, set the Title to be the same as
NERC_Department_Name.

OpenSpending already has code to update entities (and classifiers) not
by their "name" but by any random combination of entity attributes (in
the example here the two NERC fields). This way, we can uniquely
identify the entity in the context of the dataset, other datasets can
add their own data (and even change the title) and things still stay
in sync.

These composite identifiers for entities seem doable to me, what do you think?

> How can we solve this? If you look at the [NERC data][1] from earlier,
> you'll see my (hackish) solution with the current format: force the
> `department` to be unique by including its unique id from my database.
> This is ugly, and there's no reason users of OpenSpending should have
> to see "Media Studies (456)".
>
> How should we solve this? The solution is trivial with multiple data files:
>
> ### institutions.csv
> id,name
> 1,University of Bognor Regis
> 2,Slough University
>
> ### departments.csv
> id,institution_id,name
> 1,1,Media Studies
> 2,2,Media Studies
>
> ### entries.csv
> value,person,department
> 3000,1,1
> 4200,1,1
> 6000,2,2
>
> This pattern will be familiar to anyone who's ever worked with
> relational databases. The ability to use unique ids in the main data
> file (entries.csv) means I can separate semantics ("To which
> department am I referring?") from appearance ("How do I want this
> department to appear on the website.").

I fail to see how this is different from specifying PKs, can you elaborate?

> Google's exploits in this area have led to DSPL, the [Dataset
> Publishing Language][3], which is an extant specification which deals
> with these issues, and provides a metadata format for expression the
> relationship between the different CSVs.
>
> [3]: http://code.google.com/apis/publicdata/

I agree that DSPL has a nice way to handle multi-file input and
specifying joins, my concern with it is two things:

1) It is not spending-specific and we would need to add the spending
semantics separately into it (e.g. as an XML namespace). These would
include basically all the core metadata of OpenSpending: spender,
recipient, which field is the amount, ... In short: while DSPL is a
fantastic solution to create mapping definitions, it will not help us
to describe the dataset dimensions as far as I can see. And since I
think our focus should not be on data mapping but on dimension
descriptions, I'm not sure introducing its complexity is justified.

2) I think it is relatively complex, both to implement a consumer and
to implement a UI for editing it. And since the spec is not open and
can be changed by Google whenever they shift focus for Publicdata
Explorer, all the hate that schema.org has received applies here as
well. This is why I'd vote for DSPL export but not for using it as a
working format.

- Friedrich