[ckan-dev] Offering the same resource in multiple file formats

Florian.Brucker at it.karlsruhe.de Florian.Brucker at it.karlsruhe.de
Mon Mar 27 11:53:29 UTC 2017


Hi Derek,

thanks for your input!

> Speaking with my 'relational database' hat on, I would think that 
> keeping multiple copies of the same dataset is very problematic. 

I don't see a problem with multiple "copies" in different formats per se, 
but I do think that these copies need to be managed well. That's why I'm 
currently looking for ideas and experiences ;)

Regarding relational databases: not all of our data is relational, so I'd 
prefer a solution that doesn't make too much assumptions about the data.

> Given how powerful the tools in the Python data science libraries 
> are, I would argue for keeping one version (preferably the original)
> of the source data, and putting tools or data processing chains into
> place (these can be run in the background; for example, as Celery 
> processes) that allow on-the-fly conversion to formats such as Excel
> or JSON.

The problem is that background tasks won't work for answering web 
requests. Hence I either need to do the conversion at request time in the 
foreground or do it before the actual request (this can be done in the 
background). Doing it in the foreground only works if the conversion is 
quick and doesn't require much resources, so this depends very much on the 
size of the data and the format it is stored in.

> A longer term option would be to renegotiate with the upstream data 
> provider to supply data in the format that most users seem to be 
> asking for (or perhaps one that is most amenable for many 
> transformation options).

In our case the problem is not so much the data we get from upstream but 
the fact that different use cases of the same data often require different 
formats. For example, if you want to "manually" explore tabular data then 
Excel is a good choice, but for anything automated CSV is much nicer. 
Unfortunately Excel doesn't like the usual CSV format, so either we or 
some of the users will need to do some conversion. In such cases I'd love 
to make it as easy as possible for the users since there's no point in 
publishing data if it's difficult to use ;)


Regards,
Florian


"Derek Hohls" <dhohls at csir.co.za> schrieb am 27.03.2017 12:30:23:

> Von: "Derek Hohls" <dhohls at csir.co.za>
> An: <Florian.Brucker at it.karlsruhe.de>, <ckan-dev at lists.okfn.org>, 
> Datum: 27.03.2017 12:51
> Betreff: Re: [ckan-dev] Offering the same resource in multiple file 
formats
> 
> Hi Florian
> 
> I am speaking somewhat "from "the side" - our group is involved in a
> CKAN implementation but I am not at the core level. Nonetheless,  a 
> large part of my work does deal with data ingestion and processing. 
> Speaking with my 'relational database' hat on, I would think that 
> keeping multiple copies of the same dataset is very problematic. 
> Given how powerful the tools in the Python data science libraries 
> are, I would argue for keeping one version (preferably the original)
> of the source data, and putting tools or data processing chains into
> place (these can be run in the background; for example, as Celery 
> processes) that allow on-the-fly conversion to formats such as Excel
> or JSON.  Add in some caching and the more common requests should be
> able to be handled fairly efficiently. 
> 
> A longer term option would be to renegotiate with the upstream data 
> provider to supply data in the format that most users seem to be 
> asking for (or perhaps one that is most amenable for many 
> transformation options).
> 
> Just some ideas for discussion.
> 
> Derek
> 
> >>> <Florian.Brucker at it.karlsruhe.de> 03/24/17 5:00 PM >>>
> I've been thinking a bit about how to present the same resource in 
> multiple formats to the user from a UI perspective. 
> 
> The obvious way is to create a separate copy of the resource for 
> each secondary format (say, an XLSX-copy of each CSV-resource). This
> has the benefit that the secondary resource is, from a UI 
> perspective, just another resource, and all of CKAN's features 
> (search, facets, API access, ...) work as expected. A disadvantage, 
> however, is that we now have two (or even more) copies of the same 
> resource that only differ in their format. Not only need all of 
> those copies to be kept in sync (can be automated, but still), but 
> it might confuse users who now wonder if there are any differences 
> between these resources. 
> 
> A second possibility would therefore be to somehow "augment" the 
> original resource with the other formats. There are multiple ways of
> doing this (e.g. injecting conversion links via the templates), but 
> all of these will break many CKAN features. 
> 
> Finally, one could use a hybrid approach by creating full-blown 
> resources as in the first approach but combining them into a single 
> pseudo-resource for display purposes in the templates. 
> 
> Honestly I'm not happy with either of these approaches, so I'd love 
> to hear some other ideas on how to tackle this. 
> 
> 
> Regards, 
> Florian
> 
> 
> "ckan-dev" <ckan-dev-bounces at lists.okfn.org> schrieb am 07.03.2017 
14:19:24:
> 
> > Von: Florian.Brucker at it.karlsruhe.de 
> > An: ckan-dev at lists.okfn.org, 
> > Datum: 07.03.2017 14:19 
> > Betreff: [ckan-dev] Offering the same resource in multiple file 
formats 
> > Gesendet von: "ckan-dev" <ckan-dev-bounces at lists.okfn.org> 
> > 
> > Hi everybody, 
> > 
> > I often would like to offer the same resource in multiple file 
> > formats. For example, Excel's auto-import for CSV is rather broken, 
> > so instead of mangling all our CSV-files to suit Excel's needs I'd 
> > rather just offer XSLX-files of the same data in addition to 
> > "standard"-compliant CSV-files for everybody else. 
> > 
> > However, I definitely don't want to manually maintain the separate 
> > versions. Has anybody set up automated ways of doing this? Off the 
> > top of my head, I could image 
> > 
> > 1. Generating converted copies when the original resource is 
> created/modified
> > 2. Generating converted copies when they are requested 
> > 
> > Both have their pros and cons, so I'd love the hear some real-world 
> > experiences. 
> > 
> > In addition I'm wondering about the best way to present this choice 
> > to the user. 
> > 
> > 
> > Regards, 
> > Florian_______________________________________________
> > ckan-dev mailing list
> > ckan-dev at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/ckan-dev
> > Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
> This message is subject to the CSIR's copyright terms and 
> conditions, e-mail legal notice, and implemented Open Document 
> Format (ODF) standard. 
> The full disclaimer details can be found at http://www.csir.co.za/
> disclaimer.html. 
> 
> Please consider the environment before printing this email. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20170327/d51a0e5b/attachment-0003.html>


More information about the ckan-dev mailing list