[ckan-dev] Harvest with multiple consumers / improving performance?

Thu Aug 23 15:21:36 UTC 2018

Hi,

Does anyone use multiple consumers to speed up the harvesting process?
We have a setup with around 30 harvest sources with different amounts of
datasets (basically one source for each publisher). Most of these harvester
run daily and in our setup we have 1 gather_consumer and 1 fetch_consumer
(as systemd daemons) and a cronjob that runs regularly to start jobs or
mark them as finished.

So far this is more or less the setup as described in ckanext-harvest (
https://github.com/ckan/ckanext-harvest#setting-up-the-harvesters-on-a-production-server
).

We regularly have the problem that one harvester clogs the queue as it has
several thousand datasets, blocking other harvesters from finishing earlier
(i.e. they can only finish when the large ~3h harvest job is done).

Does anyone have already made performance optimizations in such a scenario?
I was wondering if it's a good idea to simply start multiple consumers, so
that dataset could be imported in parallel. I'm not sure if this is
supported or leads to problems down the road (e.g. competing commits to
Solr?). Or are you aware of other things that could be done to improve the
overall performance?

Any hints would be greatly appreciated!

- Stefan

-- 
Liip AG  // Limmatstrasse 183 //  CH-8005 Zürich
Tel +41 43 500 39 80 // GnuPG 0x7B588C67 // www.liip.ch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20180823/83cbdc9a/attachment.html>