[okfn-labs] Calendar fields (date/time)
Stefan Urbanek
stefan.urbanek at gmail.com
Sun Feb 23 09:21:16 UTC 2014
On 21 Feb 2014, at 17:32, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> More great feedback questions :-) - its really appreciated so please keep it coming.
>
> Also thanks to Michael's and Alioune's comments too!
>
> On 19 February 2014 22:58, Stefan Urbanek <stefan.urbanek at gmail.com> wrote:
> [snip]
>
[snip]
> I like this proposal a lot but there are certain questions I have:
>
> - What do data packagers do with time series data they get that is just a year (e.g. as in your example of bond yields). Personally i've found myself more and more just converting these to actual dates e.g. 2004 => 2004-01-01, 2004 Q3 => 2004-10-01 etc. This means you get a real date but at the cost of adding some precision that wasn't really there.
>
Why not last day of the year, because the yearly ETL is being done on January 1st at 3am in the morning? Why not 1st of April as it represents a fiscal year that is not aligned with the calendar year? Why not middle of the month? What about “2004-03” - is that March? Is that third week? Is that beginning of the third week or last day of the week? The later is more common. You see my point.
The Year → Date conversion is very context specific and should not be assumed at any point in the data management process. What if you would like to merge datasets with different kinds of conversions?
I think we are trying to skip several steps in the process here, but I am afraid that it is not trivially possible.
"What the data pacakgers do”? They should provide a piece of another metadata with the information, how that “2004 integer” or “2004-01 string” should be interpreted. It will be up to the ETL and visualisation tool to recognise and use that metadata.
Worst scenario with the year: just plot it as an with an integer on X axis. It is a bit worse with the month scenario, as there might be multiple ways of representing the month. In this case you can’t avoid a conversion, but you can’t expect tools (ETL/viz/analytical) to apply that conversion, unless it is well defined standard.
Even if you do ETL with any kind of conversion, for the sake of data provenance, you should include the used conversion metadata in the output. Otherwise you might spend quite a lot of time finding why your results do not match reality/expectations.
Date metadata modelling is a bit more complicated that it seems.
Since we don’t have control over some kind of “Open Data Date Dimension”, we can’t put any date/time assumptions in to the standard. We put an assumption in there, and the next dataset that we are going to process will have different needs, just to prove that we were wrong.
One possible solution might be, since the date/time is frequently used kind of data, to add metadata with information about the partial date/time content of a field. For example: “calendar_unit” or “time_granularity”. Date/time format string would require tool creators to create their own parsers, since the standard ones assume full date and will not provide tokenized output of the format.
Example from Cubes: we introduced more generic metadata called “role”, which currently can be “time” but “geo” is planned as well. It tells the front-end tools, that a dimension is a time dimension and that, by-default, the level attributes represent calendar units with the name of the level (which can be overridden by explicitly specifying role of a level). Again, it is up to the tool to use that metadata or ignore it.
> - the date type did have some use in, for example, affecting how the graph and grid views work. For example, setting a field to date means that the graphing library would interpret that field as a date. This is is somewhat minor and this isn't a problem if you have already converted "2004" to "2004-01-01"
>
Additional metadata and interpretation of that metadata at the tool’s side can do the trick.
> So, to summarize: I agree with making the change you propose re types - are you happy to open an appropriate issue re JSON Table Schema on https://github.com/dataprotocols/dataprotocols/issues
Sure.
>
> Information that a field contains a year or any other combination of calendar units should be in some kind of analytical metadata. We don't have to define them yet, as they might be very case-specific.
>
> I think string plus pattern/format attribute should be ok for this. I also note we have a units spec http://dataprotocols.org/units/ (not much used yet and not formally integrated into JSON Table Schema but it could be)
>
Format is very ambiguous. Is it format how the data are stored? Is it format how the data should be displayed? I can have a number field, and to have the format to display 12345.67 as “12 345,67”. The reverse might be the case as well. Another problem with format is, that the format string needs to be parsed and many custom parsers will have to be created. Just think of how many date/time formatting formats are there… Are we going to define data package standard? What are we going to base it on? C/Python functions "%Y-%m-%d"? MS Excel/SQL "yyyy-mm-dd”? What about others? What about units that can be formatted in the output (week number) but are not parseable by some systems (some SQL)? People might just dump format they used in their ETL tool to the “format” field and it can be in any format. Then it highly increases complexity and requirements for the tool writers, which I think are not worth the cost.
I think that this kind of metadata should be in a separate field and served in a verbose, explicit form that will not require parsing for most of the cases. We might start with just time-specific and then later we might have more generic field-role metadata, if it will be desirable.
This is one of the reasons why I’ve suggested “Encapsulated Extensibility” in #103 [1] – we can incubate metadata in some separate structure and then move them to the top-level standard when stable/evolved/well-definable.
[1] https://github.com/dataprotocols/dataprotocols/issues/103
It might seem like complication, as we suddenly need to do a bit more than we are used to. But if this problem is not handled well, it might cause data quality issues. Proper handling of the problem at this point might be just recognising and mentioning it explicitly, so others are aware of it.
Stefan
> Rufus
>
> [snip]
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140223/eb4ddd29/attachment-0004.html>
More information about the okfn-labs
mailing list