[okfn-labs] Calendar fields (date/time)
Stefan Urbanek
stefan.urbanek at gmail.com
Wed Feb 19 22:58:03 UTC 2014
Hi there,
I’ve started adding datapackage support to bubbles and I’m trying that on the datasets from https://github.com/datasets. I’ve noticed one issue with the data, their metadata and the data package specification.
PROBLEM
In the bond-yields-uk-10y dataset[1] there is:
{
"id": "Year",
"type": "date",
"format": "yyyy"
},
What is the problem? The field in the data file contains an integer. The metadata says it is “date”.
DESCRIPTION
From parsing and validating point of view, the field metadata as specified above are correct. We (humans) know that it is date (in fact, part of a date). We also might have a validator (of one type) for the format.
From data processing it is not easy to handle this case:
1. there is no standard way of specifying not-specific date (at higher granularity than a day). There are many ways of doing it, but that's another topic.
2. the metadata says that it is a date, but it can not be converted to a valid system date datatype – neither in Python nor in SQL and I doubt about other environments as well. Reason is the point number 1.
3. what format is the format? ... but that's another issue, I'm not going into more depth here.
From machine process-ability it is just an integer or a string. As '2012-04' is just a string, not april. `2012` is not `January 1st 2012`, `2012-04` is not `April 1st 2012`. There are datasets with mixed date granularity: month level and daily levels for example – all in the same dataset, with same fields. If we just converted every date-like string to a full date, we would not be able to distinguish "whole year 2012" from "April 2012" from "April 1st 2012".
Example practice from data warehouses: the quite common way of modelling date dimension is to use integer to specify a date: `YYYYMMDD`. If the `MMDD` part is 0000 as in “20140000" then the value corresponds either to whole year or unspecified part of the year (we can’t tell without more information). Similar for `DD` == `00`. The values for "unspecified" might also be `9999` instead of `0000` for sorting purposes, as in “20149999”.
There might be other formats as well, such as adding one more digit denoting the distinction between "whole" and "unspecified" or adding hourly or ½ day granularity such as `2012040101` for "morning of April 1st 2014".
It is not as simple as stating "date format" when dealing with heterogenous datasources with different granularities and reporting needs...
PROPOSAL
Here is a proposal for specification and for the data package creators.
Specification should say that all `date`, `time` and `datetime` fields should conform to an ISO standard or some other well defined standard. Current spec says:
• date: a date. This MUST be in ISO6801 format YYYY-MM-DD or, if not, a format field must be provided describing the structure.
• time: a time without a date
• datetime: a date-time. This MUST be in ISO 8601 format of YYYY-MM-DDThh:mm:ssZ in UTC time or, if not, a format field must be provided.
I would highly recommend to drop the "... if not, a format field must be provided describing the structure." as it will cause many automated processing complications and will result in serious data quality issues.
Information that a field contains a year or any other combination of calendar units should be in some kind of analytical metadata. We don't have to define them yet, as they might be very case-specific.
CONCLUSION
* “type” in the specification should denote machine-processable data type with clear representation in existing systems (languages, databases, …)
* there are many ways of representing dates
* there is no common way of representing incomplete/higher granularity dates
* metadata can have multiple levels (ETL, analytical processing, human, …)
[1] https://github.com/datasets/bond-yields-uk-10y/blob/master/datapackage.json#L20
What do you think?
Cheers,
Stefan Urbanek
p.s.: This is not a rant, it is just recommendation from experience and from potential complexity of conformation to the spec.
Twitter: @Stiivi
Personal: stiivi.com
Data Brewery: databrewery.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140219/381690bd/attachment-0003.html>
More information about the okfn-labs
mailing list