[data-protocols] Simple Data Format: straw man

Rufus Pollock rufus.pollock at okfn.org
Wed May 16 13:52:12 BST 2012


On 15 May 2012 12:24, Nick Stenning <nick at whiteink.com> wrote:
>> On Tuesday, May 15, 2012 11:18:56 AM Chris Taggart <countculture at gmail.com> wrote:
>>
>> Only allowing CSV for the data ... because it's heavily nested (... there's no practical
>> benefit to normalising it into individual tables).
>
> Not sure I follow this. Either you normalise it into individual tables
> (not terribly easy but a must for key concepts) or it's denormalised
> in either JSON or CSV format. Is there really a problem with
> converting between
>
>    { "category": { "id": 4, "name": "foobar" } }
>
> and
>
>    category.id,category.name
>    4,foobar
>
> ?
>
> Anyway, on this note and in this vein, I've also put together the
> start of a DSPL-inspired data description format, although mine is not
> so much "inspired" by DSPL as lifted wholesale, with the main
> "features" being:
>
> 1) A JSON schema. Parsing anything else on the client side is a
> nightmare, and I'd like to extend the OpenSpending model editor to be
> able to create these dataset-description schemas.

Note that standard dataset metadata should be as per the Data packages
spec <http://www.dataprotocols.org/en/latest/data-packages.html>
(though that needs refining)

> 2) I don't care about reusability of *concepts* across datasets.
> Semantic web be damned, if dataset owners have to spend two days
> working out which namespace to use to describe their data, they won't
> do it.

Ack.

> At the moment it's called DSPL JSON: see
> https://github.com/nickstenning/dspljson. Comments and criticisms of
> course welcomed on this too!

Looks interesting. SDF is definitely DSPL inspired rather than a DSPL
port at the moment. As you discuss below perhaps we should be taking
more from DSPL.

> And now a few comments on SDF:
>
>> The format’s focus is on simplicity and web usage – that is, usage
>> online with access and transmission over HTTP.
>
> I think you should probably stress this differently. I assume you mean
> usage by the client-side (i.e. Javascript parseability) rather than
> transport over HTTP: there are no problems with transmitting gzipped
> XLS over HTTP, but it's not a very useful data format for what we're
> talking about.

Yes and no. For example, there's no way to get at individual parts of
that gzipped XLS file (whereas I could theoretically do a range query
on a CSV file and I can stream it over the web which is very
attractive, for example, for the DataProxy service:
<https://github.com/okfn/dataproxy>)

> Other things that occur to me on reading the SDF web page:
>
> 1) You stress that it's inspired by DSPL, but it doesn't actually
> appear to share a great deal with DSPL's data model. Where are the
> distinctions between tables and slices? Can I create a hierarchy of
> dimensions using something like DSPL's "topics"? Is there any support
> for column mapping like DSPL's slice tablerefs? I'm not suggesting you
> should add all these features, I'm just not sure what you've actually
> taken from DSPL other than being a format for describing a dataset.

Some of the ref stuff is there but much of the rest isn't. I'm in two
minds about adding it atm. a) I feel you can quite a bit of the better
type definition etc that DSPL does via JSON-LD and the simple type
stuff there at the moment b) it adds quite a bit of complexity to the
structure without that much benefit. I'd like to keep SDF as simple as
possible. At the same time I want to acknowledge the inspirational
debt of DSPL here.

> 2) The choice you've made that causes me most concern is to have a
> schema file per data file. That makes actually consuming one of these
> datasets substantially more difficult for a client-side application.
> How do I know which schema files exist to start with? Will I need to
> create my own "index.json"? If so, you should specify the format of
> that file too.

There is an idea of having a manifest as part of datapackage.json but
I was imagine you could just list the directory contents and pick out
the csv files. At the same time we could consolidate the schema stuff
into one big file if that were useful (the current separation was to
make it easier to just drop in some data from e.g. a different SDF
package into your one).

> 3) As suggested earlier, I'm deeply dubious of the linked data aspects
> of DSPL. You've gone for compatibility with JSON-LD, which steals the
> "@type" attribute from you, reducing the piece of information that
> *really* matters (@simpletype, simpletype, or simple_type, depending
> where in the documentation you look) to a second-class citizen. Will
> anyone really implement support for the @type field?

It allows you to say some richer stuff. E.g. this is a year, or this
is a dc:title and also richer elements. That said people can not use
it.

Re simple type we could change it to 'type' and have @type for the
JSON-LD stuff (my only worry was confusion over the two ...)

> I think this is a really important problem to solve, hence my toying
> with dspljson. I think that DSPL really gets a lot of this right, with
> two exceptions: XML, and Linked Data (which is an optional part of
> DSPL anyway).

So I'm not huge on linked data but I think JSON-LD approach allows a)
a nice upgrade approach b) a way to re-use some of the rich type stuff
very easily.

> Anyway, SDF is a good talking point, and I hope some of the comments
> above are of interest.

Most definitely.

Rufus



More information about the data-protocols mailing list