[okfn-labs] Calendar fields (date/time)
Alioune Dia
dia.aliounes at gmail.com
Thu Mar 13 14:15:57 UTC 2014
Stefan
Maybe you are right :He should use standart iso format , But It is hard to
be sure that all tools wish need to deal with datapackages will deal with
iso , this is a big challenge to normalize things .
--Ad
2014-02-19 23:58 GMT+01:00 Stefan Urbanek <stefan.urbanek at gmail.com>:
> Hi there,
>
> I've started adding datapackage support to bubbles and I'm trying that on
> the datasets from https://github.com/datasets. I've noticed one issue
> with the data, their metadata and the data package specification.
>
> PROBLEM
>
> In the bond-yields-uk-10y dataset[1] there is:
>
> {
> "id": "Year",
> "type": "date",
> "format": "yyyy"
> },
>
> What is the problem? The field in the data file contains an integer. The
> metadata says it is "date".
>
>
> DESCRIPTION
>
> From parsing and validating point of view, the field metadata as specified
> above are correct. We (humans) know that it is date (in fact, part of a
> date). We also might have a validator (of one type) for the format.
>
> From data processing it is not easy to handle this case:
>
> 1. there is no standard way of specifying not-specific date (at higher
> granularity than a day). There are many ways of doing it, but that's
> another topic.
>
> 2. the metadata says that it is a date, but it can not be converted to a *valid
> system date datatype* - neither in Python nor in SQL and I doubt about
> other environments as well. Reason is the point number 1.
>
> 3. what format is the format? ... but that's another issue, I'm not going
> into more depth here.
>
> From machine process-ability it is just an integer or a string. As
> '2012-04' is just a string, not april. `2012` is not `January 1st 2012`,
> `2012-04` is not `April 1st 2012`. There are datasets with mixed date
> granularity: month level and daily levels for example - all in the same
> dataset, with same fields. If we just converted every date-like string to a
> full date, we would not be able to distinguish "whole year 2012" from
> "April 2012" from "April 1st 2012".
>
> Example practice from data warehouses: the quite common way of modelling
> date dimension is to use integer to specify a date: `YYYYMMDD`. If the
> `MMDD` part is 0000 as in "20140000" then the value corresponds either to
> whole year or unspecified part of the year (we can't tell without more
> information). Similar for `DD` == `00`. The values for "unspecified" might
> also be `9999` instead of `0000` for sorting purposes, as in "20149999".
>
> There might be other formats as well, such as adding one more digit
> denoting the distinction between "whole" and "unspecified" or adding hourly
> or 1/2 day granularity such as `2012040101` for "morning of April 1st 2014".
>
> It is not as simple as stating "date format" when dealing with
> heterogenous datasources with different granularities and reporting needs...
>
>
> PROPOSAL
>
> Here is a proposal for specification and for the data package creators.
>
> Specification should say that all `date`, `time` and `datetime` fields
> should conform to an ISO standard or some other well defined standard.
> Current spec says:
>
> * *date*: a date. This MUST be in ISO6801 format YYYY-MM-DD or, if not, a
> format field must be provided describing the structure.
> * *time*: a time without a date
> * *datetime*: a date-time. This MUST be in ISO 8601 format of
> YYYY-MM-DDThh:mm:ssZ in UTC time or, if not, a format field must be
> provided.
>
> I would highly recommend to drop the *"... if not, a format field must be
> provided describing the structure."* as it will cause many automated
> processing complications and will result in serious data quality issues.
>
> Information that a field contains a year or any other combination of
> calendar units should be in some kind of analytical metadata. We don't have
> to define them yet, as they might be very case-specific.
>
>
> CONCLUSION
>
> * "type" in the specification should denote machine-processable *data
> type with clear representation* in existing systems (languages,
> databases, ...)
> * there are many ways of representing dates
> * there is no common way of representing incomplete/higher granularity
> dates
> * metadata can have multiple levels (ETL, analytical processing, human, ...)
>
> [1]
> https://github.com/datasets/bond-yields-uk-10y/blob/master/datapackage.json#L20
>
> What do you think?
>
> Cheers,
>
> Stefan Urbanek
>
> p.s.: This is not a rant, it is just recommendation from experience and
> from potential complexity of conformation to the spec.
>
> *Twitter:* @Stiivi
> *Personal:* stiivi.com
> * Data Brewery:* databrewery.org
>
>
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140313/90f7f9c7/attachment-0003.html>
More information about the okfn-labs
mailing list