[data-protocols] Extending the JSON Table Schema for scientific applications

Wed Apr 16 14:51:40 UTC 2014

On 16 April 2014 00:47, Aldcroft, Tom <taldcroft at gmail.com> wrote:
>
>
> On Tue, Apr 15, 2014 at 12:13 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:
>
>> First off: welcome and great to hear from you!
>>
>> On 15 April 2014 14:25, Tom Aldcroft <taldcroft at gmail.com> wrote:
>>
>>> Hi -
>>>
>>> I am working on a standard for text data tables for use in science
>>> applications, in particular in the context of the Python astropy package (
>>> http://astropy.org).  This includes support for reading and writing
>>> ASCII tables in various formats that are common in astronomy (
>>> http://astropy.readthedocs.org/en/latest/io/ascii/index.html).
>>>
>>> The draft proposal I submitted baselines an approach that is very
>>> similar to the Tabular Data Package standard in data-protocols.  After
>>> discussion we are very interested in adopting the JSON Table Schema from
>>> Data Protocols.  See:
>>>
>>>  https://github.com/taldcroft/astropy-APEs/blob/ape6/APE6.rst
>>>  https://github.com/astropy/astropy-APEs/pull/7
>>>
>>> The question I have is to what extent your organization would be
>>> interested in extending the JSON Table Schema standard to include more
>>> optional elements that would be common in science applications.  As a rough
>>> outline, we would like to see:
>>>
>>
>> Just to be clear are you talking about JSON Table Schema or Tabular Data
>> Package? As you know Tabular Data Package is basically "Data Package" +
>> JSON Table Schema (for describing the CSVs) + CSV (for the data)
>>
>
> The initial focus is on the JSON Table Schema.  We are considering a
> variant on the Tabular Data Package that embeds the Table Schema header
> into the CSV data file using comment # characters.  This is being debated.
>  There are good reasons not to create this non-standard CSV, but there is
> also advantage in having one file per dataset.  In science and engineering
> it is quite common to embed comments into CSV files, and most science-based
> parsers support this.
>

the original version of tabular data package did have a serializatoin into
the the CSV but my feeling is that the costs def outweigh benefits
(basically you "break" the data file for everyone who is using the
metadata, you can't use other standard tools on the CSV without hassle etc,
metadata and data can and should be separate etc).

At the top level:
>>>
>>> - "schema-name": optional, name (or URL) of detailed schema which allows
>>> for interpretation and validation of the table and header content
>>>
>>
>> Our instinct at the moment is to support this via "profiles":
>> https://github.com/dataprotocols/dataprotocols/issues/87
>>
>> We're a little bit vague about how this would exactly work but the idea
>> is that you'd register a profile name and then it would be up to tools to
>> do something if they recognize a given profile (so in your case you could
>> have a profile "astropy" or similar and then your tools would recognize
>> that and do some extra validation).
>>
>
> That would work.
>
>
>>
>>
>>> - "keywords": optional, list of keyword structures which includes
>>> {"name" (req'd), "value" (req'd), "unit" (optional), "description"
>>> (optional)
>>>
>>
>> Not sure I get this. Would this be on JSON Table Schema or the Data
>> Package level?
>>
>
> JSON Table Schema.  Maybe an example would help.  Here is some of the
> metadata associated with a photon event table for a Chandra X-ray
> Observatory observation:
>
>  --  COMMENT                                     / Configuration control
> block--------------------
>  --  COMMENT                This FITS file may contain long string keyword
> values that are
>  --  COMMENT                continued over multiple keywords.  The HEASARC
> convention uses the &
>  --  COMMENT                character at the end of each substring which
> is then continued
>  --  COMMENT                on the next keyword which has the name
> CONTINUE.
>  --  COMMENT                character at the end of each substring which
> is then continued
>  --  COMMENT                on the next keyword which has the name
> CONTINUE.
> 0001 ASOLFILE             pcadf223285894N002_asol1.fits  String
> 0002 THRFILE              acisD1996-11-01evtspltN0002.fits String
> 0003 ORIGIN               ASC                            String
> Source of FITS file
> 0004 CREATOR              destreak - CIAO 3.4            String       tool
> that created this output
> 0005 ASCDSVER             CIAO 3.4                       String
> ASCDS version number
> 0006 MJD_OBS                  53398.3176048140           Real8
>  Modified Julian date of observation
> 0007 DS_IDENT             ADS/Sa.CXO#obs/05759           String
> dataset identifier
> 0008 TLMVER               P008                           String
> Telemetry revision number (IP&CL)
> 0009 REVISION             2                              Int4
> Processing version of data
> 0010 CHECKSUM             3MOF5LMC3LMC3LMC               String       HDU
> checksum updated 2007-01-04T17:46:24
> 0011 DATASUM              3082206520                     String
> data unit checksum updated 2007-01-04T17:46:24
>
> The details are irrelevant but this shows the use of "keyword" structures
> (e.g. "name": "ASOLFILE", "value": "pcadf223285894N002_asol1.fits", etc).
>  It also shows "comments".  In general it would be possible to replace some
> of the comments with a reference to a a profile, but not entirely.
>

Makes sense.

For me none of this would go in JSON Table Schema but would go one level up
on the resource object in the (tabular) data package. That's partly why I
asked about JSON Table Schema vs Data Package above.

>
>
>>
>>> - "comments": optional, list of general string comments
>>>
>>
>> Again at what level is this wanted and whats the planned usage. Is this
>> comments on particular data fields or columns?
>>
>
> Comments on the entire data table.  Having comments on particular fields
> could also be useful, and there is one common ASCII table format in
> astronomy that supports this.
>
>
>>
>>
>>> - "history": optional, list of records indicating processing history
>>> (providing provenance of the data in the file).
>>>
>>
>> This is definitely interesting but my concern is what exactly its
>> interpretation would be. I note there is a proposal around adding a
>> "scripts" field:
>> https://github.com/dataprotocols/dataprotocols/issues/114
>>
>
> The interpretation would need to be driven by the associated profile.
>  Here "history" is necessarily vague, but has very wide usage in astronomy,
> particularly in pipeline process outputs.
>

ditto.

[snip]

Would definitely be possible to allow this as an optional enhancement.
>> Alternative would be something linked-data-y in terms of the type field -
>> see
>>
>
> Was there a link missing here?
>

https://github.com/dataprotocols/dataprotocols/issues/89

>
>>
>>>  - "dtype": optional, detailed data type which indicates a specific
>>> binary representation (e.g. float32, int8, uint8, float64).  This can be
>>> important in numerical applications and is required to round-trip a data
>>> structure to file and back with minimal information loss.
>>>
>>
>> What's the exact use case for dtype beyond current type. How important is
>> it (nowadays) to distinguish different types of floats or ints?
>>
>
> In numerical work it's quite important.  The basis of most science /
> engineering analysis is still C / C++ and Fortran libraries.  That's true
> of Python (numpy) as well, so if a user has declared a particular type like
> float32 or uint8 then it's frequently important to get back that type.  The
> goal is high-fidelity data storage and interchange.
>

Hmmm. I hear you. I guess here opening an issue on dataprotocols tracker
listing the set of expanded field types you would want would be useful. It
would be wonderful if we could just like with integer and number but that
may not be possible!

Rufus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/data-protocols/attachments/20140416/53567948/attachment.html>