[data-protocols] RFC: JSON Table Schema

Xavier Badosa xbadosa at gmail.com
Thu Nov 29 20:02:44 GMT 2012


>
> Just to be absolutely clear this is a proposal for Tabular *Schema* not a
> format for transmitting


Thanks for the clarification. The reason I tend not to distinguish both
aspects has to do with the fact that in JSON-stat both things *can* (but
not *must*) be retrieved together using the same syntax (JSON) (I think
this has many benefits). My idea is that any modern API should support
flexible partial responses (the possibility of including/removing any
selection of nodes of the document tree). So the schema part without the
actual data would mostly mean to retrieve all but the "value" property in
JSON-stat's full response.

Of course, the opposite should also be possible (only data). That's why
JSON-stat tries to keep absolutely separated data and metadata: the "value"
property (an array) contains data and only data (in the statistical sense).
This is not the case in CSV-based solutions, where there's an asymmetry in
the treatment of rows and columns.

Re the census how do they specific type or other information about the
> tabular data they send?


Apparently this information is not retrievable through the API (you must be
familiar with the 2010 Census Summary File 1 and the 2010 American
Community Survey). Apparently you are supposed to know the IDs of the
different metrics or concepts they have (like "B25070_003E" for "Gross Rent
as a Percentage of Household Income") and the classification variable names
and their codes (like "state:06,36" for CA and NY). They absolutely need
something like your tabular schema!

ability to point out to a schema for a given dimension using uri keyword
> (BTW: I think it would be nice to be able to do this for the whole schema
> ...)


As I said earlier, I planed to solve this through partial responses
(include all the nodes but the "value" array). The reason why dimensions
can be referenced via URIs has to do with the fact that they are usually
shared between different datasets (for example, the list of the US states
or of the economic activities according to the NACE, SIC or NAICS systems).

* json-stat requires a cube style setup (possibly a very good thing!) in
> that every dataset AFAICT has one and only metric ("fact").


No: a dataset can have more than one metric in JSON-stat because metrics
are treated as dimensions (this is one of the main differences to DSPL).

* (?) json-stat requires a time and geo dimension


As the specs is now, yes (every statistical observation is supposed to
happen in a time -and only one- and place -and only one-). This aspect has
been criticized in the SDMX-JSON group arguing that some phenomena (like
flows) can have more than one geographical dimension. The JSON-stat might
need to be updated at some point to reflect this.

X.

PS: Thank you for adding JSON-stat to the Related Work section.

On Thu, Nov 29, 2012 at 1:53 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

>
> On 28 November 2012 19:42, Xavier Badosa <xbadosa at gmail.com> wrote:
>
>> Hi Rufus,
>>
>> Your simple schema for tabular data is interesting: it's similar but more
>> powerful than the schema used by the US Census Bureau API:
>>
>> http://www.census.gov/developers/
>>
>
> Just to be absolutely clear this is a proposal for Tabular *Schema* not a
> format for transmitting tabular data itself (the latter is also importanta
> and while related is distinct).
>
> Re the census how do they specific type or other information about the
> tabular data they send?
>
>
>> It's important to notice that many times what is considered "tabular
>> data" (in your sense: some fields that are shared by a set of individuals)
>> could be better represented in a cube model. Take for example the Census
>> API:
>>
>> [
>>   ["P0010001","NAME","state"],
>>   ["710231","Alaska","02"],
>>   ["4779736","Alabama","01"],
>>   ["2915918","Arkansas","05"],
>>   ["6392017","Arizona","04"],
>>   ["37253956","California","06"],
>>   ...
>> ]
>>
>> Rows in this example have an ID and this ID represents the possible
>> values of a "variable" or "dimension" ("state" in the example). Instead of
>> saying that this is some tabular data of indivuals (that happen to be
>> states) with field "population" ("P0010001"), it seems more accurate to see
>> it as a table ("table" in the statistical sense, not in the DB sense) or
>> cube of population by state. This is a very frequent situation in
>> statistics.
>>
>
> Yes, this is essentially trivial it's CSV (or tabular) as JSON :-)
>
>
>> To solve this special case (tabular data that is actually cubical,
>> multidimensional) I have proposed JSON-stat
>>
>> http://json-stat.org/doc/
>>
>
> I've read this previously in detail and think it is really good (we should
> really make sure it is added to
> http://www.dataprotocols.org/en/latest/data-formats.html).
>
> json-stat of course is both the data itself plus a schema. The relevant
> part here therefore would be the schema language. The main difference
> AFAICT between json-stat and the proposal here at the moment are:
>
> * dimensions key rather than fields key
> * json-stat lists the field ids in an array called ids (rather than making
> dimensions an array)
> * json-stat has quite a lot on enumerating values
> * ability to point out to a schema for a given dimension using uri keyword
> (BTW: I think it would be nice to be able to do this for the whole schema
> ...)
> * json-stat requires a cube style setup (possibly a very good thing!) in
> that every dataset AFAICT has one and only metric ("fact").
> * (?) json-stat requires a time and geo dimension
>
> One of my feelings at the moment is that partitioning a bit could be
> useful - that is separating the schema from the data from the general
> metadata. One reason for suggesting this spec was, for example, that you
> could use it with an existing CSV file (that may not even be under your
> control). I'd really like ways that we could progressively enhance existing
> data(sets) without always needing to entirely transform the underlying data
> in some format.
>
>
>> Besides, the statistical community uses the SDMX standard for expressing
>> statistics and is currently working on a JSON façade (SDMX-JSON). I'm a
>> member of the SDMX-JSON group. JSON-stat is used in that group as a
>> starting point.
>>
>> Probably we could benefit from some of your ideas.
>>
>
> Likewise :-)
>
> Rufus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/data-protocols/attachments/20121129/47705701/attachment.htm>


More information about the data-protocols mailing list