[ckan-dev] DataStore seems to grow indefinitely when harvesting

Stefan Oderbolz stefan.oderbolz at liip.ch
Wed Feb 1 21:03:54 UTC 2017


In the meantime I created the issue (
https://github.com/ckan/ckan/issues/3422) to tackle the deletion of
DataStore tables, when resources are deleted. But looking at the DataStore
and especially the state of the DataPusher led me to believe, that no one
really uses these components in production at the moment. Or am I wrong?

Also the performance problems we encounter when harvesting ~1500 datasets
are a sign, that this has not yet been tested with real load (and I mean:
compared to others, this is nothing). We are currently investigating on
different sides to see why we have these problems, but I'm curious to hear
from others. We currently suspect the DataPusher (which runs into
timeouts). But maybe we even have problems on the PostgreSQL end and need
to spend some to improve our database server setup.

In case this all comes over as very negative, please don't get me wrong: we
are happy to contribute and will definitely give back our fixes and share
our experience!

Best regards Stefan

On Fri, Jan 27, 2017 at 1:43 PM, Stefanie Taepke <stefanie.taepke at liip.ch>
wrote:

> Hey all!
>
> I would like to discuss how DataPusher and DataStore works for you in
> Production and I hope this is the right place for that.
>
> Right now we have set it up for our Testing-Environment. So every time we
> harvest, the DataPusher is triggered it loads everything it can to the
> DataStore. There I could see that every time we harvest, the
> datastore_default-database grows, even though the original data did not
> change (as much). Each Harvesting for one Harvest-Job added 2GB to the
> DataStore. I assume that this is a result of the re-harvested resources
> every time something on a dataset changes.
>
> If I am not mistaken, there is no logic whatsoever on when data is deleted
> from the DataStore? Partly, this is great, as this means, the endpoint of
> the data does not change if I want to create something with the data from
> the datastore.
>
> I understand, that, as explained here https://github.com/ckan/
> ckan/issues/3268, that it is hard to retrieve if there have been changes
> to the data itself. Sure, we can implement it.
>
> What I am wondering is, how are you dealing with this if you use this in
> Production? Does your DataBase grow indefinitely or am I missing something
> trivial? Is there something like a cleanup-task, that we can run (or
> implement) and are there any plans yet on how to tackle this if you have
> similar problems?
>
>
> Cheers and thank you for your thoughts and input,
> Stef
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>


-- 
Liip AG  // Limmatstrasse 183 //  CH-8005 Zürich
Tel +41 43 500 39 80 // GnuPG 0x7B588C67 // www.liip.ch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20170201/372f8209/attachment-0002.html>


More information about the ckan-dev mailing list