[data-protocols] Simple Data Format: straw man

Wed May 16 21:05:35 BST 2012

Hey Rufus,

didn't mean to sound harsh, just freaked out by things here.

On Wed, May 16, 2012 at 9:05 PM, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 16 May 2012 15:29, Friedrich Lindenberg
> <friedrich.lindenberg at okfn.org> wrote:
>> Hi all,
>>
>> this is becoming quite complex, but I don't think we have our use
>> cases straight at all yet. So there are three different problems we
>> could try and solve:
>>
>> 1) Provenance metadata (as per @Francis). Different discussion.
>> 2) Column-level metadata, typed CSV (SDF, as far as I understand)
>> 3) A logical model of a dataset which can then be represented in CSV
>
> SDF could be extended somewhat to support (3) -- see response to Nick.
> I'd held back because I'm not sure about complexity that full DSPL
> brings.

I don't think this is about add-on understanding, its about the very
basic way you start describing the data: you either go from content to
representation or you go from representation to content. I think that
a  useful format would do the former, describing the logical structure
of the data cleanly. It just seems like a cleaner approach, allowing
much better work with the data.

(cf. http://en.wikipedia.org/wiki/Logical_data_model)

>> Out of these, I think the only thing that really adds the necessary
>> value while being largely unresolved is #3. Unfortunately, it requires
>> some degree of #1 and #2. What I mean by logical model is: something
>> that
>
> What currently resolves (2) for you?

I didn't say it was resolved, I just don't think it adds that much
value. In a way, most of what it does is solve the very specific
problem of presenting data in a table in a more custom way compared to
hat you could guess. That's nice, but really, really limited.

Very soon, you'll want to add information about hierarchies,
relations, etc. and then at some point your format will look like a
coat worn outside in: the most general concept will be in the deepest
part of the model.

>> - describes which entities that are represented (please don't kill me
>> over the noun, I want to cover both dimensions and facts),
>> - which attributes they have,
>> - what roles those attributes play (measures, primary keys) and
>> - which links combine the different entities.
>
> [...]
>
>> In all of this, I believe this has to be opinionated and pragmatic. We
>> have to make some decisions without developing a theory of knowledge.
>> That means you need to answer the type question, not provide 15
>> answers like the current SDF proposal.
>
> I think it's 2 (without the JSON-LD support it would be 1). I'm happy
> to drop JSON-LD support but I do think it is a nice option.

Sorry about the polemic here. I think type: and format: are very useful.

>> If we can in any way agree not to use linked data, I could avert my
>> suicidal tendencies associated with interacting with that community.
>> This goes as far as not re-using existing RDF ontologies, because that
>> brings them to your home and they will come at night and demand you
>> triplify your own family.
>
> You're a clear -1 on JSON-LD support :-)
>
>> As for the storage format, I think CSV doesn't open all the doors you
>> may want on your caravan, but it *works*. Once you become agnostic (or
>> even just slightly polyamorous, i.e. JSON), you have to go into a "set
>> of format factories" thing in your implementation again, which just
>> makes it a non-solution because you cannot rely upon support. I have
>> dBase files and I'm not afraid to use them!
>
> OK we have a clear -1 on any JSON support for the data transport. And
> I definitely agree with having fewer (preferably) one way to do
> things.
>
>> In general, I think we should work up the requirements a bit more, but
>> I think for the kinds of cases that we're mostly looking at, DSPL/JSON
>> may be a very good starting point that we can use and develop. This
>> would keep it lean, not requiring people to buy into a whole data
>> package idea or triplificationtheory.
>
> I'm not sure there's a "whole data package idea" -- it's pretty basic.
>
> Using data package metadata was just the equivalent of the basic
> metadata that DSPL ships but meant we could reuse some of the Data
> Package idea (SDF stuff would be data packages but not all data
> packages would be SDF).

Right. My concern is that it may be horizontal to this concept: A data
model may refer to tables in several data packages, e.g. a core facts
table, the main csv from the cofog package and the iso-3166 package's
2-char codesheet. Of course you can work around this by inlining
things, but that just removes flexibility for the sake of sticking to
a concept. This is not to say we shouldn't recommend the same core set
of metadata fields.

Cheers,

 - Friedrich