[data-protocols] Simple Data Format: straw man

Friedrich Lindenberg friedrich.lindenberg at okfn.org
Wed May 16 15:29:21 BST 2012


Hi all,

this is becoming quite complex, but I don't think we have our use
cases straight at all yet. So there are three different problems we
could try and solve:

1) Provenance metadata (as per @Francis). Different discussion.
2) Column-level metadata, typed CSV (SDF, as far as I understand)
3) A logical model of a dataset which can then be represented in CSV

Out of these, I think the only thing that really adds the necessary
value while being largely unresolved is #3. Unfortunately, it requires
some degree of #1 and #2. What I mean by logical model is: something
that

- describes which entities that are represented (please don't kill me
over the noun, I want to cover both dimensions and facts),
- which attributes they have,
- what roles those attributes play (measures, primary keys) and
- which links combine the different entities.

This is fairly abstract from the underlying representation. That's why
it would enable the representation of data from multiple, normalized
sources. As a benefit, this then enables analytics and provides enough
metadata to drive things like visualization builders, record linkers,
de-duplication across several attributes, etc. etc.

In all of this, I believe this has to be opinionated and pragmatic. We
have to make some decisions without developing a theory of knowledge.
That means you need to answer the type question, not provide 15
answers like the current SDF proposal.

If we can in any way agree not to use linked data, I could avert my
suicidal tendencies associated with interacting with that community.
This goes as far as not re-using existing RDF ontologies, because that
brings them to your home and they will come at night and demand you
triplify your own family.

As for the storage format, I think CSV doesn't open all the doors you
may want on your caravan, but it *works*. Once you become agnostic (or
even just slightly polyamorous, i.e. JSON), you have to go into a "set
of format factories" thing in your implementation again, which just
makes it a non-solution because you cannot rely upon support. I have
dBase files and I'm not afraid to use them!

In general, I think we should work up the requirements a bit more, but
I think for the kinds of cases that we're mostly looking at, DSPL/JSON
may be a very good starting point that we can use and develop. This
would keep it lean, not requiring people to buy into a whole data
package idea or triplificationtheory.

Cheers,

 - Friedrich

On Wed, May 16, 2012 at 2:52 PM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 15 May 2012 12:24, Nick Stenning <nick at whiteink.com> wrote:
>>> On Tuesday, May 15, 2012 11:18:56 AM Chris Taggart <countculture at gmail.com> wrote:
>>>
>>> Only allowing CSV for the data ... because it's heavily nested (... there's no practical
>>> benefit to normalising it into individual tables).
>>
>> Not sure I follow this. Either you normalise it into individual tables
>> (not terribly easy but a must for key concepts) or it's denormalised
>> in either JSON or CSV format. Is there really a problem with
>> converting between
>>
>>    { "category": { "id": 4, "name": "foobar" } }
>>
>> and
>>
>>    category.id,category.name
>>    4,foobar
>>
>> ?
>>
>> Anyway, on this note and in this vein, I've also put together the
>> start of a DSPL-inspired data description format, although mine is not
>> so much "inspired" by DSPL as lifted wholesale, with the main
>> "features" being:
>>
>> 1) A JSON schema. Parsing anything else on the client side is a
>> nightmare, and I'd like to extend the OpenSpending model editor to be
>> able to create these dataset-description schemas.
>
> Note that standard dataset metadata should be as per the Data packages
> spec <http://www.dataprotocols.org/en/latest/data-packages.html>
> (though that needs refining)
>
>> 2) I don't care about reusability of *concepts* across datasets.
>> Semantic web be damned, if dataset owners have to spend two days
>> working out which namespace to use to describe their data, they won't
>> do it.
>
> Ack.
>
>> At the moment it's called DSPL JSON: see
>> https://github.com/nickstenning/dspljson. Comments and criticisms of
>> course welcomed on this too!
>
> Looks interesting. SDF is definitely DSPL inspired rather than a DSPL
> port at the moment. As you discuss below perhaps we should be taking
> more from DSPL.
>
>> And now a few comments on SDF:
>>
>>> The format’s focus is on simplicity and web usage – that is, usage
>>> online with access and transmission over HTTP.
>>
>> I think you should probably stress this differently. I assume you mean
>> usage by the client-side (i.e. Javascript parseability) rather than
>> transport over HTTP: there are no problems with transmitting gzipped
>> XLS over HTTP, but it's not a very useful data format for what we're
>> talking about.
>
> Yes and no. For example, there's no way to get at individual parts of
> that gzipped XLS file (whereas I could theoretically do a range query
> on a CSV file and I can stream it over the web which is very
> attractive, for example, for the DataProxy service:
> <https://github.com/okfn/dataproxy>)
>
>> Other things that occur to me on reading the SDF web page:
>>
>> 1) You stress that it's inspired by DSPL, but it doesn't actually
>> appear to share a great deal with DSPL's data model. Where are the
>> distinctions between tables and slices? Can I create a hierarchy of
>> dimensions using something like DSPL's "topics"? Is there any support
>> for column mapping like DSPL's slice tablerefs? I'm not suggesting you
>> should add all these features, I'm just not sure what you've actually
>> taken from DSPL other than being a format for describing a dataset.
>
> Some of the ref stuff is there but much of the rest isn't. I'm in two
> minds about adding it atm. a) I feel you can quite a bit of the better
> type definition etc that DSPL does via JSON-LD and the simple type
> stuff there at the moment b) it adds quite a bit of complexity to the
> structure without that much benefit. I'd like to keep SDF as simple as
> possible. At the same time I want to acknowledge the inspirational
> debt of DSPL here.
>
>> 2) The choice you've made that causes me most concern is to have a
>> schema file per data file. That makes actually consuming one of these
>> datasets substantially more difficult for a client-side application.
>> How do I know which schema files exist to start with? Will I need to
>> create my own "index.json"? If so, you should specify the format of
>> that file too.
>
> There is an idea of having a manifest as part of datapackage.json but
> I was imagine you could just list the directory contents and pick out
> the csv files. At the same time we could consolidate the schema stuff
> into one big file if that were useful (the current separation was to
> make it easier to just drop in some data from e.g. a different SDF
> package into your one).
>
>> 3) As suggested earlier, I'm deeply dubious of the linked data aspects
>> of DSPL. You've gone for compatibility with JSON-LD, which steals the
>> "@type" attribute from you, reducing the piece of information that
>> *really* matters (@simpletype, simpletype, or simple_type, depending
>> where in the documentation you look) to a second-class citizen. Will
>> anyone really implement support for the @type field?
>
> It allows you to say some richer stuff. E.g. this is a year, or this
> is a dc:title and also richer elements. That said people can not use
> it.
>
> Re simple type we could change it to 'type' and have @type for the
> JSON-LD stuff (my only worry was confusion over the two ...)
>
>> I think this is a really important problem to solve, hence my toying
>> with dspljson. I think that DSPL really gets a lot of this right, with
>> two exceptions: XML, and Linked Data (which is an optional part of
>> DSPL anyway).
>
> So I'm not huge on linked data but I think JSON-LD approach allows a)
> a nice upgrade approach b) a way to re-use some of the rich type stuff
> very easily.
>
>> Anyway, SDF is a good talking point, and I hope some of the comments
>> above are of interest.
>
> Most definitely.
>
> Rufus
>
> _______________________________________________
> data-protocols mailing list
> data-protocols at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/data-protocols



More information about the data-protocols mailing list