[data-protocols] Extending the JSON Table Schema for scientific applications

Tue Apr 15 23:47:55 UTC 2014

On Tue, Apr 15, 2014 at 12:13 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:

> First off: welcome and great to hear from you!
>
> On 15 April 2014 14:25, Tom Aldcroft <taldcroft at gmail.com> wrote:
>
>> Hi -
>>
>> I am working on a standard for text data tables for use in science
>> applications, in particular in the context of the Python astropy package (
>> http://astropy.org).  This includes support for reading and writing
>> ASCII tables in various formats that are common in astronomy (
>> http://astropy.readthedocs.org/en/latest/io/ascii/index.html).
>>
>> The draft proposal I submitted baselines an approach that is very similar
>> to the Tabular Data Package standard in data-protocols.  After discussion
>> we are very interested in adopting the JSON Table Schema from Data
>> Protocols.  See:
>>
>>  https://github.com/taldcroft/astropy-APEs/blob/ape6/APE6.rst
>>  https://github.com/astropy/astropy-APEs/pull/7
>>
>> The question I have is to what extent your organization would be
>> interested in extending the JSON Table Schema standard to include more
>> optional elements that would be common in science applications.  As a rough
>> outline, we would like to see:
>>
>
> Just to be clear are you talking about JSON Table Schema or Tabular Data
> Package? As you know Tabular Data Package is basically "Data Package" +
> JSON Table Schema (for describing the CSVs) + CSV (for the data)
>

The initial focus is on the JSON Table Schema.  We are considering a
variant on the Tabular Data Package that embeds the Table Schema header
into the CSV data file using comment # characters.  This is being debated.
 There are good reasons not to create this non-standard CSV, but there is
also advantage in having one file per dataset.  In science and engineering
it is quite common to embed comments into CSV files, and most science-based
parsers support this.

>
>> At the top level:
>>
>> - "schema-name": optional, name (or URL) of detailed schema which allows
>> for interpretation and validation of the table and header content
>>
>
> Our instinct at the moment is to support this via "profiles":
> https://github.com/dataprotocols/dataprotocols/issues/87
>
> We're a little bit vague about how this would exactly work but the idea is
> that you'd register a profile name and then it would be up to tools to do
> something if they recognize a given profile (so in your case you could have
> a profile "astropy" or similar and then your tools would recognize that and
> do some extra validation).
>

That would work.

>
>
>> - "keywords": optional, list of keyword structures which includes {"name"
>> (req'd), "value" (req'd), "unit" (optional), "description" (optional)
>>
>
> Not sure I get this. Would this be on JSON Table Schema or the Data
> Package level?
>

JSON Table Schema.  Maybe an example would help.  Here is some of the
metadata associated with a photon event table for a Chandra X-ray
Observatory observation:

 --  COMMENT                                     / Configuration control
block--------------------
 --  COMMENT                This FITS file may contain long string keyword
values that are
 --  COMMENT                continued over multiple keywords.  The HEASARC
convention uses the &
 --  COMMENT                character at the end of each substring which is
then continued
 --  COMMENT                on the next keyword which has the name CONTINUE.
 --  COMMENT                character at the end of each substring which is
then continued
 --  COMMENT                on the next keyword which has the name CONTINUE.
0001 ASOLFILE             pcadf223285894N002_asol1.fits  String
0002 THRFILE              acisD1996-11-01evtspltN0002.fits String
0003 ORIGIN               ASC                            String
Source of FITS file
0004 CREATOR              destreak - CIAO 3.4            String       tool
that created this output
0005 ASCDSVER             CIAO 3.4                       String       ASCDS
version number
0006 MJD_OBS                  53398.3176048140           Real8
 Modified Julian date of observation
0007 DS_IDENT             ADS/Sa.CXO#obs/05759           String
dataset identifier
0008 TLMVER               P008                           String
Telemetry revision number (IP&CL)
0009 REVISION             2                              Int4
Processing version of data
0010 CHECKSUM             3MOF5LMC3LMC3LMC               String       HDU
checksum updated 2007-01-04T17:46:24
0011 DATASUM              3082206520                     String       data
unit checksum updated 2007-01-04T17:46:24

The details are irrelevant but this shows the use of "keyword" structures
(e.g. "name": "ASOLFILE", "value": "pcadf223285894N002_asol1.fits", etc).
 It also shows "comments".  In general it would be possible to replace some
of the comments with a reference to a a profile, but not entirely.

>
>> - "comments": optional, list of general string comments
>>
>
> Again at what level is this wanted and whats the planned usage. Is this
> comments on particular data fields or columns?
>

Comments on the entire data table.  Having comments on particular fields
could also be useful, and there is one common ASCII table format in
astronomy that supports this.

>
>
>> - "history": optional, list of records indicating processing history
>> (providing provenance of the data in the file).
>>
>
> This is definitely interesting but my concern is what exactly its
> interpretation would be. I note there is a proposal around adding a
> "scripts" field: https://github.com/dataprotocols/dataprotocols/issues/114
>

The interpretation would need to be driven by the associated profile.  Here
"history" is necessarily vague, but has very wide usage in astronomy,
particularly in pipeline process outputs.

>
> BTW: I should emphasize that the Data Package spec in theory allows one to
> add fields as one likes. That said, there is a preference not to
> unnecessarily extend (or to get items into core spec) - and even an
> argument <https://github.com/dataprotocols/dataprotocols/issues/103> that
> we should not allow extension at all ...
>

Yes, I understand this and it makes sense.  If the preference is to keep
the core Data Package spec more minimal, then we will happily define a
meta-standard that starts from Data Package and then specifies the
extension pattern.  What we want to is to have a well-documented standard
for encoding the metadata we need.

>
> In the standard "fields" specification:
>>
>>  - "unit": optional, specifies physical unit of data (e.g m/s)
>>
>
> As you probably know there is a units spec here
> http://dataprotocols.org/units/
>

Yes, I would need to check how that standard fits in with the conventions
in wide use for astronomy.  In general any sort of table readers need to
accept pretty much anything for units because of a lot of legacy data sets
out in the wild.  Writing can be fussier.

> Would definitely be possible to allow this as an optional enhancement.
> Alternative would be something linked-data-y in terms of the type field -
> see
>

Was there a link missing here?

>
>
>>  - "dtype": optional, detailed data type which indicates a specific
>> binary representation (e.g. float32, int8, uint8, float64).  This can be
>> important in numerical applications and is required to round-trip a data
>> structure to file and back with minimal information loss.
>>
>
> What's the exact use case for dtype beyond current type. How important is
> it (nowadays) to distinguish different types of floats or ints?
>

In numerical work it's quite important.  The basis of most science /
engineering analysis is still C / C++ and Fortran libraries.  That's true
of Python (numpy) as well, so if a user has declared a particular type like
float32 or uint8 then it's frequently important to get back that type.  The
goal is high-fidelity data storage and interchange.

>
>
>> At this point I'm mostly interested in general discussion of whether it's
>> worth opening a pull request to extend the JSON Table Schema in the
>> direction I've outlined, with details TBD.
>>
>
> Sounds good. I also note a lot of discussion goes on in the issue tracker
> at https://github.com/dataprotocols/dataprotocols/issues
>

Sounds good.  I'll take this over to the issue tracker, perhaps with a more
specific proposal, and we'll see where that goes.

Cheers,
Tom

>
> Rufus
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/data-protocols/attachments/20140415/1212f517/attachment-0001.html>