[data-protocols] RFC: JSON Table Schema

Thu Nov 29 12:53:46 GMT 2012

On 28 November 2012 19:42, Xavier Badosa <xbadosa at gmail.com> wrote:

> Hi Rufus,
>
> Your simple schema for tabular data is interesting: it's similar but more
> powerful than the schema used by the US Census Bureau API:
>
> http://www.census.gov/developers/
>

Just to be absolutely clear this is a proposal for Tabular *Schema* not a
format for transmitting tabular data itself (the latter is also importanta
and while related is distinct).

Re the census how do they specific type or other information about the
tabular data they send?

> It's important to notice that many times what is considered "tabular data"
> (in your sense: some fields that are shared by a set of individuals) could
> be better represented in a cube model. Take for example the Census API:
>
> [
>   ["P0010001","NAME","state"],
>   ["710231","Alaska","02"],
>   ["4779736","Alabama","01"],
>   ["2915918","Arkansas","05"],
>   ["6392017","Arizona","04"],
>   ["37253956","California","06"],
>   ...
> ]
>
> Rows in this example have an ID and this ID represents the possible values
> of a "variable" or "dimension" ("state" in the example). Instead of saying
> that this is some tabular data of indivuals (that happen to be states) with
> field "population" ("P0010001"), it seems more accurate to see it as a
> table ("table" in the statistical sense, not in the DB sense) or cube of
> population by state. This is a very frequent situation in statistics.
>

Yes, this is essentially trivial it's CSV (or tabular) as JSON :-)

> To solve this special case (tabular data that is actually cubical,
> multidimensional) I have proposed JSON-stat
>
> http://json-stat.org/doc/
>

I've read this previously in detail and think it is really good (we should
really make sure it is added to
http://www.dataprotocols.org/en/latest/data-formats.html).

json-stat of course is both the data itself plus a schema. The relevant
part here therefore would be the schema language. The main difference
AFAICT between json-stat and the proposal here at the moment are:

* dimensions key rather than fields key
* json-stat lists the field ids in an array called ids (rather than making
dimensions an array)
* json-stat has quite a lot on enumerating values
* ability to point out to a schema for a given dimension using uri keyword
(BTW: I think it would be nice to be able to do this for the whole schema
...)
* json-stat requires a cube style setup (possibly a very good thing!) in
that every dataset AFAICT has one and only metric ("fact").
* (?) json-stat requires a time and geo dimension

One of my feelings at the moment is that partitioning a bit could be useful
- that is separating the schema from the data from the general metadata.
One reason for suggesting this spec was, for example, that you could use it
with an existing CSV file (that may not even be under your control). I'd
really like ways that we could progressively enhance existing data(sets)
without always needing to entirely transform the underlying data in some
format.

> Besides, the statistical community uses the SDMX standard for expressing
> statistics and is currently working on a JSON façade (SDMX-JSON). I'm a
> member of the SDMX-JSON group. JSON-stat is used in that group as a
> starting point.
>
> Probably we could benefit from some of your ideas.
>

Likewise :-)

Rufus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-protocols/attachments/20121129/fd780a72/attachment.htm>