[data-protocols] Extending the JSON Table Schema for scientific applications
Rufus Pollock
rufus.pollock at okfn.org
Wed Apr 16 14:51:40 UTC 2014
On 16 April 2014 00:47, Aldcroft, Tom <taldcroft at gmail.com> wrote:
>
>
> On Tue, Apr 15, 2014 at 12:13 PM, Rufus Pollock <rufus.pollock at okfn.org>wrote:
>
>> First off: welcome and great to hear from you!
>>
>> On 15 April 2014 14:25, Tom Aldcroft <taldcroft at gmail.com> wrote:
>>
>>> Hi -
>>>
>>> I am working on a standard for text data tables for use in science
>>> applications, in particular in the context of the Python astropy package (
>>> http://astropy.org). This includes support for reading and writing
>>> ASCII tables in various formats that are common in astronomy (
>>> http://astropy.readthedocs.org/en/latest/io/ascii/index.html).
>>>
>>> The draft proposal I submitted baselines an approach that is very
>>> similar to the Tabular Data Package standard in data-protocols. After
>>> discussion we are very interested in adopting the JSON Table Schema from
>>> Data Protocols. See:
>>>
>>> https://github.com/taldcroft/astropy-APEs/blob/ape6/APE6.rst
>>> https://github.com/astropy/astropy-APEs/pull/7
>>>
>>> The question I have is to what extent your organization would be
>>> interested in extending the JSON Table Schema standard to include more
>>> optional elements that would be common in science applications. As a rough
>>> outline, we would like to see:
>>>
>>
>> Just to be clear are you talking about JSON Table Schema or Tabular Data
>> Package? As you know Tabular Data Package is basically "Data Package" +
>> JSON Table Schema (for describing the CSVs) + CSV (for the data)
>>
>
> The initial focus is on the JSON Table Schema. We are considering a
> variant on the Tabular Data Package that embeds the Table Schema header
> into the CSV data file using comment # characters. This is being debated.
> There are good reasons not to create this non-standard CSV, but there is
> also advantage in having one file per dataset. In science and engineering
> it is quite common to embed comments into CSV files, and most science-based
> parsers support this.
>
the original version of tabular data package did have a serializatoin into
the the CSV but my feeling is that the costs def outweigh benefits
(basically you "break" the data file for everyone who is using the
metadata, you can't use other standard tools on the CSV without hassle etc,
metadata and data can and should be separate etc).
At the top level:
>>>
>>> - "schema-name": optional, name (or URL) of detailed schema which allows
>>> for interpretation and validation of the table and header content
>>>
>>
>> Our instinct at the moment is to support this via "profiles":
>> https://github.com/dataprotocols/dataprotocols/issues/87
>>
>> We're a little bit vague about how this would exactly work but the idea
>> is that you'd register a profile name and then it would be up to tools to
>> do something if they recognize a given profile (so in your case you could
>> have a profile "astropy" or similar and then your tools would recognize
>> that and do some extra validation).
>>
>
> That would work.
>
>
>>
>>
>>> - "keywords": optional, list of keyword structures which includes
>>> {"name" (req'd), "value" (req'd), "unit" (optional), "description"
>>> (optional)
>>>
>>
>> Not sure I get this. Would this be on JSON Table Schema or the Data
>> Package level?
>>
>
> JSON Table Schema. Maybe an example would help. Here is some of the
> metadata associated with a photon event table for a Chandra X-ray
> Observatory observation:
>
> -- COMMENT / Configuration control
> block--------------------
> -- COMMENT This FITS file may contain long string keyword
> values that are
> -- COMMENT continued over multiple keywords. The HEASARC
> convention uses the &
> -- COMMENT character at the end of each substring which
> is then continued
> -- COMMENT on the next keyword which has the name
> CONTINUE.
> -- COMMENT character at the end of each substring which
> is then continued
> -- COMMENT on the next keyword which has the name
> CONTINUE.
> 0001 ASOLFILE pcadf223285894N002_asol1.fits String
> 0002 THRFILE acisD1996-11-01evtspltN0002.fits String
> 0003 ORIGIN ASC String
> Source of FITS file
> 0004 CREATOR destreak - CIAO 3.4 String tool
> that created this output
> 0005 ASCDSVER CIAO 3.4 String
> ASCDS version number
> 0006 MJD_OBS 53398.3176048140 Real8
> Modified Julian date of observation
> 0007 DS_IDENT ADS/Sa.CXO#obs/05759 String
> dataset identifier
> 0008 TLMVER P008 String
> Telemetry revision number (IP&CL)
> 0009 REVISION 2 Int4
> Processing version of data
> 0010 CHECKSUM 3MOF5LMC3LMC3LMC String HDU
> checksum updated 2007-01-04T17:46:24
> 0011 DATASUM 3082206520 String
> data unit checksum updated 2007-01-04T17:46:24
>
> The details are irrelevant but this shows the use of "keyword" structures
> (e.g. "name": "ASOLFILE", "value": "pcadf223285894N002_asol1.fits", etc).
> It also shows "comments". In general it would be possible to replace some
> of the comments with a reference to a a profile, but not entirely.
>
Makes sense.
For me none of this would go in JSON Table Schema but would go one level up
on the resource object in the (tabular) data package. That's partly why I
asked about JSON Table Schema vs Data Package above.
>
>
>>
>>> - "comments": optional, list of general string comments
>>>
>>
>> Again at what level is this wanted and whats the planned usage. Is this
>> comments on particular data fields or columns?
>>
>
> Comments on the entire data table. Having comments on particular fields
> could also be useful, and there is one common ASCII table format in
> astronomy that supports this.
>
>
>>
>>
>>> - "history": optional, list of records indicating processing history
>>> (providing provenance of the data in the file).
>>>
>>
>> This is definitely interesting but my concern is what exactly its
>> interpretation would be. I note there is a proposal around adding a
>> "scripts" field:
>> https://github.com/dataprotocols/dataprotocols/issues/114
>>
>
> The interpretation would need to be driven by the associated profile.
> Here "history" is necessarily vague, but has very wide usage in astronomy,
> particularly in pipeline process outputs.
>
ditto.
[snip]
Would definitely be possible to allow this as an optional enhancement.
>> Alternative would be something linked-data-y in terms of the type field -
>> see
>>
>
> Was there a link missing here?
>
https://github.com/dataprotocols/dataprotocols/issues/89
>
>>
>>> - "dtype": optional, detailed data type which indicates a specific
>>> binary representation (e.g. float32, int8, uint8, float64). This can be
>>> important in numerical applications and is required to round-trip a data
>>> structure to file and back with minimal information loss.
>>>
>>
>> What's the exact use case for dtype beyond current type. How important is
>> it (nowadays) to distinguish different types of floats or ints?
>>
>
> In numerical work it's quite important. The basis of most science /
> engineering analysis is still C / C++ and Fortran libraries. That's true
> of Python (numpy) as well, so if a user has declared a particular type like
> float32 or uint8 then it's frequently important to get back that type. The
> goal is high-fidelity data storage and interchange.
>
Hmmm. I hear you. I guess here opening an issue on dataprotocols tracker
listing the set of expanded field types you would want would be useful. It
would be wonderful if we could just like with integer and number but that
may not be possible!
Rufus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.okfn.org/pipermail/data-protocols/attachments/20140416/53567948/attachment.html>
More information about the data-protocols
mailing list