[ckan-dev] Offering the same resource in multiple file formats

Mon Mar 27 18:24:50 UTC 2017

Florian,

Greg here at Link has been refactoring the spatial ingestor extension we
use on data.gov.au to work in a similar way as datapusher. It operates
during create or update events to generate related spatial resources. The
zip preview/extractor also has a few things worth looking at.

I'll ask him to share some thoughts when he has time :)

Cheers,
Steven

On Tue, Mar 28, 2017 at 1:48 AM Derek Hohls <dhohls at csir.co.za> wrote:

> Hi Florian
>
> I can see we come from different backgrounds and have different
> expectations.
>
> First off, my database reference was more just to create an analogy and
> not a reference to actual data usage in CKAN.  In a relational database one
> tries to "normalise" the data so its kept in one-and-only-one place;
> different views can then combine and join data to create the final dataset.
>
> But my point remains: keeping multiple, derived versions of the same
> dataset is going to be problematic from a management point of view.  There
> is a good reason why, in legal terms, there is just one original of a
> contract. As soon as you have three or more, then questions can arise as to
> what the "real" one is.  I am not sure of CKAN's capabilities in terms of
> treating a set of datasets as exactly the same thing (and managing this set
> via a single metadata record, for instance); if it can't already do so,
> then saving these new copies (not versions) is going to be very hard to
> manage.
>
> You say that creating different formats programmatically cannot be done
> through web requests.  I think it can.  It obviously depends on the size of
> the dataset; a 10TB file is not going to be transformable in a short space
> of time but a 1-10MB file can be.  It also depends on your client/user
> expectations - and these can be managed.  Its a choice for them - 'do you
> want this format right now?' or 'do you want to wait/come back in a few
> minutes for this other format?'
>
> Not sure that this will help you; but if my boss asked me about this
> problem, the above is the route I would suggest taking.
>
> Derek
>
>
> >>> <Florian.Brucker at it.karlsruhe.de> 03/27/17 1:53 PM >>>
> Hi Derek,
>
> thanks for your input!
>
> > Speaking with my 'relational database' hat on, I would think that
> > keeping multiple copies of the same dataset is very problematic.
>
> I don't see a problem with multiple "copies" in different formats per se,
> but I do think that these copies need to be managed well. That's why I'm
> currently looking for ideas and experiences ;)
>
> Regarding relational databases: not all of our data is relational, so I'd
> prefer a solution that doesn't make too much assumptions about the data.
>
> > Given how powerful the tools in the Python data science libraries
> > are, I would argue for keeping one version (preferably the original)
> > of the source data, and putting tools or data processing chains into
> > place (these can be run in the background; for example, as Celery
> > processes) that allow on-the-fly conversion to formats such as Excel
> > or JSON.
>
> The problem is that background tasks won't work for answering web
> requests. Hence I either need to do the conversion at request time in the
> foreground or do it before the actual request (this can be done in the
> background). Doing it in the foreground only works if the conversion is
> quick and doesn't require much resources, so this depends very much on the
> size of the data and the format it is stored in.
>
> > A longer term option would be to renegotiate with the upstream data
> > provider to supply data in the format that most users seem to be
> > asking for (or perhaps one that is most amenable for many
> > transformation options).
>
> In our case the problem is not so much the data we get from upstream but
> the fact that different use cases of the same data often require different
> formats. For example, if you want to "manually" explore tabular data then
> Excel is a good choice, but for anything automated CSV is much nicer.
> Unfortunately Excel doesn't like the usual CSV format, so either we or some
> of the users will need to do some conversion. In such cases I'd love to
> make it as easy as possible for the users since there's no point in
> publishing data if it's difficult to use ;)
>
>
> Regards,
> Florian
>
>
> "Derek Hohls" <dhohls at csir.co.za> schrieb am 27.03.2017 12:30:23:
>
> > Von: "Derek Hohls" <dhohls at csir.co.za>
> > An: <Florian.Brucker at it.karlsruhe.de>, <ckan-dev at lists.okfn.org>,
> > Datum: 27.03.2017 12:51
> > Betreff: Re: [ckan-dev] Offering the same resource in multiple file
> formats
> >
> > Hi Florian
> >
> > I am speaking somewhat "from "the side" - our group is involved in a
> > CKAN implementation but I am not at the core level. Nonetheless,  a
> > large part of my work does deal with data ingestion and processing.
> > Speaking with my 'relational database' hat on, I would think that
> > keeping multiple copies of the same dataset is very problematic.
> > Given how powerful the tools in the Python data science libraries
> > are, I would argue for keeping one version (preferably the original)
> > of the source data, and putting tools or data processing chains into
> > place (these can be run in the background; for example, as Celery
> > processes) that allow on-the-fly conversion to formats such as Excel
> > or JSON.  Add in some caching and the more common requests should be
> > able to be handled fairly efficiently.
> >
> > A longer term option would be to renegotiate with the upstream data
> > provider to supply data in the format that most users seem to be
> > asking for (or perhaps one that is most amenable for many
> > transformation options).
> >
> > Just some ideas for discussion.
> >
> > Derek
> >
> > >>> <Florian.Brucker at it.karlsruhe.de> 03/24/17 5:00 PM >>>
> > I've been thinking a bit about how to present the same resource in
> > multiple formats to the user from a UI perspective.
> >
> > The obvious way is to create a separate copy of the resource for
> > each secondary format (say, an XLSX-copy of each CSV-resource). This
> > has the benefit that the secondary resource is, from a UI
> > perspective, just another resource, and all of CKAN's features
> > (search, facets, API access, ...) work as expected. A disadvantage,
> > however, is that we now have two (or even more) copies of the same
> > resource that only differ in their format. Not only need all of
> > those copies to be kept in sync (can be automated, but still), but
> > it might confuse users who now wonder if there are any differences
> > between these resources.
> >
> > A second possibility would therefore be to somehow "augment" the
> > original resource with the other formats. There are multiple ways of
> > doing this (e.g. injecting conversion links via the templates), but
> > all of these will break many CKAN features.
> >
> > Finally, one could use a hybrid approach by creating full-blown
> > resources as in the first approach but combining them into a single
> > pseudo-resource for display purposes in the templates.
> >
> > Honestly I'm not happy with either of these approaches, so I'd love
> > to hear some other ideas on how to tackle this.
> >
> >
> > Regards,
> > Florian
> >
> >
> > "ckan-dev" <ckan-dev-bounces at lists.okfn.org> schrieb am 07.03.2017
> 14:19:24:
> >
> > > Von: Florian.Brucker at it.karlsruhe.de
> > > An: ckan-dev at lists.okfn.org,
> > > Datum: 07.03.2017 14:19
> > > Betreff: [ckan-dev] Offering the same resource in multiple file
> formats
> > > Gesendet von: "ckan-dev" <ckan-dev-bounces at lists.okfn.org>
> > >
> > > Hi everybody,
> > >
> > > I often would like to offer the same resource in multiple file
> > > formats. For example, Excel's auto-import for CSV is rather broken,
> > > so instead of mangling all our CSV-files to suit Excel's needs I'd
> > > rather just offer XSLX-files of the same data in addition to
> > > "standard"-compliant CSV-files for everybody else.
> > >
> > > However, I definitely don't want to manually maintain the separate
> > > versions. Has anybody set up automated ways of doing this? Off the
> > > top of my head, I could image
> > >
> > > 1. Generating converted copies when the original resource is
> > created/modified
> > > 2. Generating converted copies when they are requested
> > >
> > > Both have their pros and cons, so I'd love the hear some real-world
> > > experiences.
> > >
> > > In addition I'm wondering about the best way to present this choice
> > > to the user.
> > >
> > >
> > > Regards,
> > > Florian_______________________________________________
> > > ckan-dev mailing list
> > > ckan-dev at lists.okfn.org
> > > https://lists.okfn.org/mailman/listinfo/ckan-dev
> > > Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
> > This message is subject to the CSIR's copyright terms and
> > conditions, e-mail legal notice, and implemented Open Document
> > Format (ODF) standard.
> > The full disclaimer details can be found at http://www.csir.co.za/
> > disclaimer.html.
> >
> > Please consider the environment before printing this email.
> ------------------------------
> This message is subject to the CSIR's copyright terms and conditions,
> e-mail legal notice, and implemented Open Document Format (ODF) standard.
> The full disclaimer details can be found at
> http://www.csir.co.za/disclaimer.html.
>
> Please consider the environment before printing this email.
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
-- 
*STEVEN DE COSTA *|
*EXECUTIVE DIRECTOR*www.linkdigital.com.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20170327/db3c3740/attachment-0003.html>