[ckan-dev] Offering the same resource in multiple file formats

Mon Mar 27 14:48:06 UTC 2017

Hi Florian

I can see we come from different backgrounds and have different expectations.

First off, my database reference was more just to create an analogy and not a reference to actual data usage in CKAN.  In a relational database one tries to "normalise" the data so its kept in one-and-only-one place; different views can then combine and join data to create the final dataset.

But my point remains: keeping multiple, derived versions of the same dataset is going to be problematic from a management point of view.  There is a good reason why, in legal terms, there is just one original of a contract. As soon as you have three or more, then questions can arise as to what the "real" one is.  I am not sure of CKAN's capabilities in terms of treating a set of datasets as exactly the same thing (and managing this set via a single metadata record, for instance); if it can't already do so, then saving these new copies (not versions) is going to be very hard to manage.

You say that creating different formats programmatically cannot be done through web requests.  I think it can.  It obviously depends on the size of the dataset; a 10TB file is not going to be transformable in a short space of time but a 1-10MB file can be.  It also depends on your client/user expectations - and these can be managed.  Its a choice for them - 'do you want this format right now?' or 'do you want to wait/come back in a few minutes for this other format?'

Not sure that this will help you; but if my boss asked me about this problem, the above is the route I would suggest taking.

Derek

>>> <Florian.Brucker at it.karlsruhe.de> 03/27/17 1:53 PM >>>
Hi Derek, 

thanks for your input! 

> Speaking with my 'relational database' hat on, I would think that 
 > keeping multiple copies of the same dataset is very problematic. 

I don't see a problem with multiple "copies" in different formats per se, but I do think that these copies need to be managed well. That's why I'm currently looking for ideas and experiences ;) 

 Regarding relational databases: not all of our data is relational, so I'd prefer a solution that doesn't make too much assumptions about the data. 

 > Given how powerful the tools in the Python data science libraries 
 > are, I would argue for keeping one version (preferably the original)
 > of the source data, and putting tools or data processing chains into
 > place (these can be run in the background; for example, as Celery 
 > processes) that allow on-the-fly conversion to formats such as Excel
 > or JSON. 

The problem is that background tasks won't work for answering web requests. Hence I either need to do the conversion at request time in the foreground or do it before the actual request (this can be done in the background). Doing it in the foreground only works if the conversion is quick and doesn't require much resources, so this depends very much on the size of the data and the format it is stored in.

> A longer term option would be to renegotiate with the upstream data 
 > provider to supply data in the format that most users seem to be 
 > asking for (or perhaps one that is most amenable for many 
 > transformation options). 

In our case the problem is not so much the data we get from upstream but the fact that different use cases of the same data often require different formats. For example, if you want to "manually" explore tabular data then Excel is a good choice, but for anything automated CSV is much nicer. Unfortunately Excel doesn't like the usual CSV format, so either we or some of the users will need to do some conversion. In such cases I'd love to make it as easy as possible for the users since there's no point in publishing data if it's difficult to use ;) 

Regards, 
Florian 

"Derek Hohls" <dhohls at csir.co.za> schrieb am 27.03.2017 12:30:23:

 > Von: "Derek Hohls" <dhohls at csir.co.za> 
> An: <Florian.Brucker at it.karlsruhe.de>, <ckan-dev at lists.okfn.org>,  > Datum: 27.03.2017 12:51 
> Betreff: Re: [ckan-dev] Offering the same resource in multiple file formats 
> 
 > Hi Florian
 > 
 > I am speaking somewhat "from "the side" - our group is involved in a
 > CKAN implementation but I am not at the core level. Nonetheless,  a 
 > large part of my work does deal with data ingestion and processing. 
 > Speaking with my 'relational database' hat on, I would think that 
 > keeping multiple copies of the same dataset is very problematic. 
 > Given how powerful the tools in the Python data science libraries 
 > are, I would argue for keeping one version (preferably the original)
 > of the source data, and putting tools or data processing chains into
 > place (these can be run in the background; for example, as Celery 
 > processes) that allow on-the-fly conversion to formats such as Excel
 > or JSON.  Add in some caching and the more common requests should be
 > able to be handled fairly efficiently.  
 > 
 > A longer term option would be to renegotiate with the upstream data 
 > provider to supply data in the format that most users seem to be 
 > asking for (or perhaps one that is most amenable for many 
 > transformation options).
 > 
 > Just some ideas for discussion.
 > 
 > Derek 
> 
 > >>> <Florian.Brucker at it.karlsruhe.de> 03/24/17 5:00 PM >>>
 > I've been thinking a bit about how to present the same resource in 
 > multiple formats to the user from a UI perspective. 
 > 
 > The obvious way is to create a separate copy of the resource for 
 > each secondary format (say, an XLSX-copy of each CSV-resource). This
 > has the benefit that the secondary resource is, from a UI 
 > perspective, just another resource, and all of CKAN's features 
 > (search, facets, API access, ...) work as expected. A disadvantage, 
 > however, is that we now have two (or even more) copies of the same 
 > resource that only differ in their format. Not only need all of 
 > those copies to be kept in sync (can be automated, but still), but 
 > it might confuse users who now wonder if there are any differences 
 > between these resources. 
 > 
 > A second possibility would therefore be to somehow "augment" the 
 > original resource with the other formats. There are multiple ways of
 > doing this (e.g. injecting conversion links via the templates), but 
 > all of these will break many CKAN features. 
 > 
 > Finally, one could use a hybrid approach by creating full-blown 
 > resources as in the first approach but combining them into a single 
 > pseudo-resource for display purposes in the templates. 
 > 
 > Honestly I'm not happy with either of these approaches, so I'd love 
 > to hear some other ideas on how to tackle this. 
 > 
 > 
 > Regards, 
 > Florian
 > 
 > 
 > "ckan-dev" <ckan-dev-bounces at lists.okfn.org> schrieb am 07.03.2017 14:19:24:
 > 
 > > Von: Florian.Brucker at it.karlsruhe.de 
 > > An: ckan-dev at lists.okfn.org, 
 > > Datum: 07.03.2017 14:19 
 > > Betreff: [ckan-dev] Offering the same resource in multiple file formats 
 > > Gesendet von: "ckan-dev" <ckan-dev-bounces at lists.okfn.org> 
 > > 
 > > Hi everybody, 
 > > 
 > > I often would like to offer the same resource in multiple file 
 > > formats. For example, Excel's auto-import for CSV is rather broken,  > > so instead of mangling all our CSV-files to suit Excel's needs I'd 
 > > rather just offer XSLX-files of the same data in addition to 
 > > "standard"-compliant CSV-files for everybody else. 
 > > 
 > > However, I definitely don't want to manually maintain the separate 
 > > versions. Has anybody set up automated ways of doing this? Off the 
 > > top of my head, I could image 
 > > 
 > > 1. Generating converted copies when the original resource is 
 > created/modified
 > > 2. Generating converted copies when they are requested 
 > > 
 > > Both have their pros and cons, so I'd love the hear some real-world  > > experiences. 
 > > 
 > > In addition I'm wondering about the best way to present this choice  > > to the user. 
 > > 
 > > 
 > > Regards, 
 > > Florian_______________________________________________
 > > ckan-dev mailing list
 > > ckan-dev at lists.okfn.org
 > > https://lists.okfn.org/mailman/listinfo/ckan-dev
 > > Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev 
> This message is subject to the CSIR's copyright terms and 
 > conditions, e-mail legal notice, and implemented Open Document 
 > Format (ODF) standard. 
 > The full disclaimer details can be found at http://www.csir.co.za/
 > disclaimer.html. 
 > 
 > Please consider the environment before printing this email. 

--

This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard. 
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html. 

Please consider the environment before printing this email. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20170327/f9abaceb/attachment-0003.html>