[data-protocols] Simple Data Format: straw man

Wed May 16 12:24:38 BST 2012

Thinking about use cases here...

I can imagine wanting to move simple datasets like this between
datahub.io and Fusion Table or similar.

Right now the workflow for that consists of:
1) Export CSV file to your laptop.
2) Upload it to Fusion Tables
3) Tell Fusion Tables what columns etc. to map

What would the new workflow be? 

Downloading multiple CSV file(s) and schema files and transfering them
to Fusion Tables (let's assume here we get Google to cooperate and
support the format) seems clunky and harder. Them getting directly
sent has other problems (I suspect needs use of Web Intents).

Or are you thinking of another kind of use case?

Francis

P.S. Love the search data, and search API stuff on that datahub.io
page!

On Wed, May 16, 2012 at 11:02:14AM +0100, Rufus Pollock wrote:
> On 15 May 2012 18:47, Francis Irving <francis at flourish.org> wrote:
> > *Off the top of my head* reactions...
> >
> > a) Absolutely *love* that it is backwards compatible with CSV. That's the
> > current format people use.
> 
> That or XLS (sadly ...)
> 
> > b) Separate schema seems overkill. Seriously, you need to say
> > something is an integer or a date explicitly? Nobody cares. It's
> > almost an insult to how much more you need to know to really
> > understand a dataset (source, quality, methodology of collection,
> > up-to-dateness etc.). It's almost always going to be inferable from
> > content, or could be gotten by hints in column headings.
> 
> The separate schema is optional and IME frequently you can't infer
> stuff from the CSV alone unfortunately (e.g. what about when 2 cols
> are lon/lat or a single col is lon/lat [1] and dates can be absolutely
> hellish). I'd note we are not requiring originators of data to
> generate the schema -- it could be auto-generated and then corrected
> by someone else later (or by the originator).
> 
> [1]: E.g. see http://datahub.io/dataset/crime-data-sf/resource/d71c4c74-b81c-4d31-a339-b214b2f95d01
> - it's obvious to a human that the location file is a bracketed
> lon/lat (or is it a lat/lon!) but it is non-trivial for a computer to
> guess that though the field name is a helpful tip ...
> 
> > c) I think I'd be happier with something where I put URLs in the
> > column headings and/or id field as a minimal-but-useful amount of data
> > linking.
> 
> That would be in the source CSV or in the schema file? If in CSV case
> do we have a double row header or ...?
> 
> > But the above isn't a criticism, just what it made me think.
> 
> :-)
> 
> Rufus
>