[ckan-dev] Harvest with multiple consumers / improving performance?

2rupnow at informatik.uni-hamburg.de 2rupnow at informatik.uni-hamburg.de
Fri Aug 24 15:10:31 UTC 2018


Hey Stefan,

 

for Hamburg we use multiple consumers to harvest multiple sources at the same time.

 

Some (outdated!) code is publicly available at github <https://github.com/transparenzportalhamburg/ckanext-distributed-harvester> .

 

Your guess with Solr is correct. We set “ckan.search.solr_commit = false” in our ckan configuration and

 

    <autoSoftCommit>

      <maxTime>1800000</maxTime>

    </autoSoftCommit>

 

    <autoCommit>

      <maxTime>15000</maxTime>

      <openSearcher>false</openSearcher>

    </autoCommit>

 

In our solrconfig.xml

 

Regards,

Dennis

 

 

Von: ckan-dev <ckan-dev-bounces at lists.okfn.org> Im Auftrag von Stefan Oderbolz
Gesendet: Donnerstag, 23. August 2018 17:22
An: CKAN Development Discussions <ckan-dev at lists.okfn.org>
Betreff: [ckan-dev] Harvest with multiple consumers / improving performance?

 

Hi,

 

Does anyone use multiple consumers to speed up the harvesting process?

We have a setup with around 30 harvest sources with different amounts of datasets (basically one source for each publisher). Most of these harvester run daily and in our setup we have 1 gather_consumer and 1 fetch_consumer (as systemd daemons) and a cronjob that runs regularly to start jobs or mark them as finished.

 

So far this is more or less the setup as described in ckanext-harvest (https://github.com/ckan/ckanext-harvest#setting-up-the-harvesters-on-a-production-server).

 

We regularly have the problem that one harvester clogs the queue as it has several thousand datasets, blocking other harvesters from finishing earlier (i.e. they can only finish when the large ~3h harvest job is done).

 

Does anyone have already made performance optimizations in such a scenario? I was wondering if it's a good idea to simply start multiple consumers, so that dataset could be imported in parallel. I'm not sure if this is supported or leads to problems down the road (e.g. competing commits to Solr?). Or are you aware of other things that could be done to improve the overall performance?

 

 

Any hints would be greatly appreciated!

 

- Stefan




 

-- 

Liip AG  // Limmatstrasse 183 //  CH-8005 Zürich
Tel +41 43 500 39 80 // GnuPG 0x7B588C67 // www.liip.ch <http://www.liip.ch> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20180824/d7e25dbe/attachment-0002.html>


More information about the ckan-dev mailing list