[okfn-labs] Calendar fields (date/time)

Stefan Urbanek stefan.urbanek at gmail.com
Thu Mar 13 16:43:16 UTC 2014


Many datasets need treatment (ETL) anyway in their early maturity stages. Cleaning-up values is one of the first steps and the problem will hit only those producing very raw data. I would say that it is a basic data quality hygiene requirement rather than a big challenge.

It does not have to be ISO format, but it has to be one format to be accepted as valid date format. There MIGHT be tools that MIGHT handle conversions from anything to anything else, in that case the date might be provided as any kind of string, integer, name it. Those are mostly cleaning tools or data preparation components of larger tools. However, tools DON’T HAVE to. And this should be explicitly stated to not to induce false expectations.

My first concern is: having the standard too broad with too many options will result in an unimplementable standard – there will be no tool that will be able to conform to it fully.

The approach should be (not only regarding the date format):

1. allow users to dump anything they have in any form/format they have, because it is good to have a lots of it and you don’t want to get into data provider’s way
2. Treat all source data as strings from the very beginning. No other content expectations.
3. Standard describes metadata and the description should be unambiguous, implementable, easy to handle.
4. Allow assignment of field data types only if they fully conform to the standard. Otherwise keep as string and hope that there will be someone in the up-stream of data processing and analysis that will handle the case.

My second concern: too much freedom will affect data quality.

Data types and value parsing according to a data type is one of the very first data quality verifications. If this will be difficult to implement or the implementation will be not complete (due to complexity of the standard), then we might generate more errors in the data that we would like to have. We might even amplify existing data issues instead of identifying and ideally correcting them.

We can’t avoid data cleaning as much as we would love to.  We have to find a balance between our natural impatience (based on time between raw data and animated bubble charts) and quality of output.

That’s my humble opinion based on my experience with data. Maybe I am just too worried :-)

Stefan

Twitter: @Stiivi
Personal: stiivi.com
Data Brewery: databrewery.org

On 13 Mar 2014, at 15:15, Alioune Dia <dia.aliounes at gmail.com> wrote:

> Stefan
> Maybe you are right :He should use standart iso format , But It is hard to be sure that all tools wish need to deal with datapackages will deal with iso , this is a big challenge to normalize things .
> --Ad
> 
> 
> 2014-02-19 23:58 GMT+01:00 Stefan Urbanek <stefan.urbanek at gmail.com>:
> Hi there,
> 
> I’ve started adding datapackage support to bubbles and I’m trying that on the datasets from https://github.com/datasets. I’ve noticed one issue with the data, their metadata and the data package specification.
> 
> PROBLEM
> 
> In the bond-yields-uk-10y dataset[1] there is:
> 
>           {
>             "id": "Year",
>             "type": "date",
>             "format": "yyyy"
>           },
> 
> What is the problem? The field in the data file contains an integer. The metadata says it is “date”.
> 
> 
> DESCRIPTION
> 
> From parsing and validating point of view, the field metadata as specified above are correct. We (humans) know that it is date (in fact, part of a date). We also might have a validator (of one type) for the format.
> 
> From data processing it is not easy to handle this case:
> 
> 1. there is no standard way of specifying not-specific date (at higher granularity than a day). There are many ways of doing it, but that's another topic.
> 
> 2. the metadata says that it is a date, but it can not be converted to a valid system date datatype – neither in Python nor in SQL and I doubt about other environments as well. Reason is the point number 1.
> 
> 3. what format is the format? ... but that's another issue, I'm not going into more depth here.
> 
> From machine process-ability it is just an integer or a string. As '2012-04' is just a string, not april. `2012` is not `January 1st 2012`, `2012-04` is not `April 1st 2012`. There are datasets with mixed date granularity: month level and daily levels for example – all in the same dataset, with same fields. If we just converted every date-like string to a full date, we would not be able to distinguish "whole year 2012" from "April 2012" from "April 1st 2012".
> 
> Example practice from data warehouses: the quite common way of modelling date dimension is to use integer to specify a date: `YYYYMMDD`. If the `MMDD` part is 0000 as in “20140000" then the value corresponds either to whole year or unspecified part of the year (we can’t tell without more information). Similar for `DD` == `00`. The values for "unspecified" might also be `9999` instead of `0000` for sorting purposes, as in “20149999”.
> 
> There might be other formats as well, such as adding one more digit denoting the distinction between "whole" and "unspecified" or adding hourly or ½ day granularity such as `2012040101` for "morning of April 1st 2014". 
> 
> It is not as simple as stating "date format" when dealing with heterogenous datasources with different granularities and reporting needs...
> 
> 
> PROPOSAL
> 
> Here is a proposal for specification and for the data package creators.
> 
> Specification should say that all `date`, `time` and `datetime` fields should conform to an ISO standard or some other well defined standard. Current spec says:
> 
> • date: a date. This MUST be in ISO6801 format YYYY-MM-DD or, if not, a format field must be provided describing the structure.
> • time: a time without a date
> • datetime: a date-time. This MUST be in ISO 8601 format of YYYY-MM-DDThh:mm:ssZ in UTC time or, if not, a format field must be provided.
> 
> I would highly recommend to drop the "... if not, a format field must be provided describing the structure." as it will cause many automated processing complications and will result in serious data quality issues.
> 
> Information that a field contains a year or any other combination of calendar units should be in some kind of analytical metadata. We don't have to define them yet, as they might be very case-specific.
> 
> 
> CONCLUSION
> 
> * “type” in the specification should denote machine-processable data type with clear representation in existing systems (languages, databases, …)
> * there are many ways of representing dates
> * there is no common way of representing incomplete/higher granularity dates
> * metadata can have multiple levels (ETL, analytical processing, human, …)
> 
> [1] https://github.com/datasets/bond-yields-uk-10y/blob/master/datapackage.json#L20
> 
> What do you think?
> 
> Cheers,
> 
> Stefan Urbanek
> 
> p.s.: This is not a rant, it is just recommendation from experience and from potential complexity of conformation to the spec.
> 
> Twitter: @Stiivi
> Personal: stiivi.com
> Data Brewery: databrewery.org
> 
> 
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140313/a89edd2b/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140313/a89edd2b/attachment-0004.sig>


More information about the okfn-labs mailing list