[ckan-dev] DataStore seems to grow indefinitely when harvesting

Matthew Fullerton matt.fullerton at gmail.com
Thu Feb 2 09:25:03 UTC 2017


Hi Stefan,
Although I am an intensive Datastore user, I also would have a really bad
feeling letting the Datapiusher run automatically for harvested datasets.
So in my view Datastore and Datapusher important for production (yes, they
are used in production), Harvesting important for production (obviously),
but both together I didn't come across yet.

Best,
Matt

On 1 February 2017 at 22:03, Stefan Oderbolz <stefan.oderbolz at liip.ch>
wrote:

> In the meantime I created the issue (https://github.com/ckan/ckan/
> issues/3422) to tackle the deletion of DataStore tables, when resources
> are deleted. But looking at the DataStore and especially the state of the
> DataPusher led me to believe, that no one really uses these components in
> production at the moment. Or am I wrong?
>
> Also the performance problems we encounter when harvesting ~1500 datasets
> are a sign, that this has not yet been tested with real load (and I mean:
> compared to others, this is nothing). We are currently investigating on
> different sides to see why we have these problems, but I'm curious to hear
> from others. We currently suspect the DataPusher (which runs into
> timeouts). But maybe we even have problems on the PostgreSQL end and need
> to spend some to improve our database server setup.
>
> In case this all comes over as very negative, please don't get me wrong:
> we are happy to contribute and will definitely give back our fixes and
> share our experience!
>
> Best regards Stefan
>
> On Fri, Jan 27, 2017 at 1:43 PM, Stefanie Taepke <stefanie.taepke at liip.ch>
> wrote:
>
>> Hey all!
>>
>> I would like to discuss how DataPusher and DataStore works for you in
>> Production and I hope this is the right place for that.
>>
>> Right now we have set it up for our Testing-Environment. So every time we
>> harvest, the DataPusher is triggered it loads everything it can to the
>> DataStore. There I could see that every time we harvest, the
>> datastore_default-database grows, even though the original data did not
>> change (as much). Each Harvesting for one Harvest-Job added 2GB to the
>> DataStore. I assume that this is a result of the re-harvested resources
>> every time something on a dataset changes.
>>
>> If I am not mistaken, there is no logic whatsoever on when data is
>> deleted from the DataStore? Partly, this is great, as this means, the
>> endpoint of the data does not change if I want to create something with the
>> data from the datastore.
>>
>> I understand, that, as explained here https://github.com/ckan/c
>> kan/issues/3268, that it is hard to retrieve if there have been changes
>> to the data itself. Sure, we can implement it.
>>
>> What I am wondering is, how are you dealing with this if you use this in
>> Production? Does your DataBase grow indefinitely or am I missing something
>> trivial? Is there something like a cleanup-task, that we can run (or
>> implement) and are there any plans yet on how to tackle this if you have
>> similar problems?
>>
>>
>> Cheers and thank you for your thoughts and input,
>> Stef
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>
>
> --
> Liip AG  // Limmatstrasse 183 //  CH-8005 Zürich
> Tel +41 43 500 39 80 <+41%2043%20500%2039%2080> // GnuPG 0x7B588C67 //
> www.liip.ch
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20170202/2716f472/attachment-0003.html>


More information about the ckan-dev mailing list