[ckan-dev] linking data in private S3 buckets

Wed Dec 17 15:21:45 UTC 2014

On 17 December, 2014 - Ian Ward wrote:

> On Wed, Dec 17, 2014 at 5:50 AM, Anton Lundin <anton at dohi.se> wrote:
> > On 24 February, 2014 - Nigel Babu wrote:
> >
> >> We don't have a timetable yet. It's still in the planning stage. We will
> >> definitely ask for comments on the list when we have a solid plan of how
> >> it's going to be implemented. You'll also be able to follow the bug when we
> >> start work.
> >
> > Started to take a look at this again. Is there any plans on how to
> > re-introduce other file storages than local filesystem again?
> ...
> > Another use case that this move to local-only-storage breaks is if you
> > would like to have a scaling farm of webservers. Then you would need to
> > involve a network filesystem to keep the filestorage consistent across
> > all webservers.
> > It also moves the heavy lifting, actually sending the files, to the
> > webserver, away from the storage solution optimized for this exact task.
> 
> I agree. Passing files through the web server isn't ideal.
> 
> Unfortunately, when users are uploading files to a private dataset
> they have an expectation that that file will be kept private. I don't
> know how to solve that when the files are stored on s3 or another
> service.
> 

For either s3 or gcs[1], its simple. Just put a private acl on the file
and its private. Put a public acl on the file, its public.
This can be decided on a per-upload basis, and changed later by a api
call to the storage service.

When downloading private files, the ckan code generates a signed url to
the file, and with that the client can download the file for a certain
time window. For a example, see ckanext-s3archive[2].

> When then old code was removed I remember is was suggested we could
> add a plugin interface that would allow moving the file to a remote
> service as a queued background task, then when that is complete update
> the link in the resource. That approach should still work, and allow
> things like the datapusher to continue to work as well. Uploading to a
> remote service would have to be disabled for non-public datasets,
> though.
> 

Its sort-of what ckanext-s3archive does, except for the queueing part.
It polls for files to archive instead.

As for non-public datasets, se answer above.

> Even better would be to allow uploads directly to that remote service.
> That would be a trickier interface to build (recognising incomplete
> uploads, etc), and it's not clear how to support private datasets, but
> it should perform much better and be much simpler on the web server
> side.
> 

For s3 storage, you can configure s3 to generate a event for you, (SQS
or SNS in for this case) when a new object is completed.

The other way to architect it is to not think of the upload as completed
before the uploading client have given ckan a signal to think of it as
so.
If a client aborts the upload mid-way, or completes it but never signals
that to ckan, just clean those left over bits out from the storage
bucket.

> Are you interested in contributing development in one of these directions?
> 

Yes we are. I'm currently investigating different strategies we can go
about in how to solve this problem for us.

The only downside is that i probably can't motivate the upside for doing
this with the engineering costs it will take to clean this up and
abstract it to a good interface this alone, according to my guestimates.

But if there are more people interested in solving this, we can pool our
resources and solve this together.

//Anton

1. Google Cloud Storage.
2. https://github.com/ckan/ckanext-s3archive/blob/master/ckanext/s3archive/controller.py#L71

-- 
Anton Lundin

anton at dohi.se
+46702-161604

http://www.dohi.se/