[data-protocols] Simple Data Format: straw man

Wed May 16 11:02:14 BST 2012

On 15 May 2012 18:47, Francis Irving <francis at flourish.org> wrote:
> *Off the top of my head* reactions...
>
> a) Absolutely *love* that it is backwards compatible with CSV. That's the
> current format people use.

That or XLS (sadly ...)

> b) Separate schema seems overkill. Seriously, you need to say
> something is an integer or a date explicitly? Nobody cares. It's
> almost an insult to how much more you need to know to really
> understand a dataset (source, quality, methodology of collection,
> up-to-dateness etc.). It's almost always going to be inferable from
> content, or could be gotten by hints in column headings.

The separate schema is optional and IME frequently you can't infer
stuff from the CSV alone unfortunately (e.g. what about when 2 cols
are lon/lat or a single col is lon/lat [1] and dates can be absolutely
hellish). I'd note we are not requiring originators of data to
generate the schema -- it could be auto-generated and then corrected
by someone else later (or by the originator).

[1]: E.g. see http://datahub.io/dataset/crime-data-sf/resource/d71c4c74-b81c-4d31-a339-b214b2f95d01
- it's obvious to a human that the location file is a bracketed
lon/lat (or is it a lat/lon!) but it is non-trivial for a computer to
guess that though the field name is a helpful tip ...

> c) I think I'd be happier with something where I put URLs in the
> column headings and/or id field as a minimal-but-useful amount of data
> linking.

That would be in the source CSV or in the schema file? If in CSV case
do we have a double row header or ...?

> But the above isn't a criticism, just what it made me think.

:-)

Rufus