[ckan-discuss] Harvester questions/issues

Adrià Mercader adria.mercader at okfn.org
Wed Oct 30 10:25:22 GMT 2013


Hi Stéphane,

I recommend clearing the source, restart all the consumers and try
again. Clearing the source needs to be done via the web interface (in
the Admin section of the source), and will delete all jobs, datasets
etc for a particular source (The purge_queue command just removes
items from the actual queue).

You should see straight away some logging in the gather consumer
console after running the run command if everything went fine.

If problems persist it might we worth checking if switching to Redis
solves the issue, we've found it to be more reliable.

For CKAN harvesters you need to set the root CKAN URL (eg
http://demo.ckan.org), and for CSW servers it shouldn't matter if you
point to the root of the server or the GetCapabiliites request.

Let us know how it works for you,

Adrià




On 29 October 2013 16:18, Stéphane Guidoin <stephane at opennorth.ca> wrote:
> Hello,
>
> I am currently evaluating how to automate the maintenance of datasets for a
> CKAN implementation (City of Montréal).
>
> I am looking at the harvester and I am not able to have it working.
>
> So I installed both CKAN and CSW harvesters following the only guide
> (https://github.com/okfn/ckanext-harvest) using RabbitMQ.
>
> I am able to access http://my_instance:5000/harvest, the harvester CLI basic
> commands work (I am able to add sources, list them, etc.)
>
> In order to run a manual job, I did the following:
>
> Add the source (CSW):
> paster --plugin=ckanext-harvest harvester source mysource
> "http://www.donnees.gouv.qc.ca/geonetwork/srv/eng/csw?SERVICE=CSW&request=GetCapabilities&AcceptVersion=2.0.2"
> csw "My Source" --config=/etc/ckan/default/production.ini
>
> Add a job for this source
> paster --plugin=ckanext-harvest harvester job
> 01aa2038-678b-4ce5-972e-ad4eaf9198f6
> --config=/etc/ckan/default/production.ini
>
> (If I ask for the list of jobs, this one appears as "new")
>
> Then, in different consoles, I start the gather and the fetch:
> paster --plugin=ckanext-harvest harvester gather_consumer
> --config=/etc/ckan/default/production.ini
> paster --plugin=ckanext-harvest harvester fetch_consumer
> --config=/etc/ckan/default/production.ini
>
> And in a third console, I launch the run command:
> paster --plugin=ckanext-harvest harvester run
> --config=/etc/ckan/default/production.ini
>
> After this, I receive a message telling me the job has been sent to the
> gather queue
> 2013-10-28 14:13:40,550 INFO  [ckanext.harvest.logic.action.update] Sent job
> 97bd1013-9160-4359-9b5a-243fbd8bc30a to the gather queue
>
> But nothing happen, the gather consumer does nothing, no process runs even
> if I leave this running for hours. The job appears as running.
>
> If I try to delete the job (purge_queue), the job remains there running, if
> I add the job again, it tells me there is already a job for that source, but
> if I try to start the "run" again, it tells me there no new job.
>
> How is it suppposed to behave? Did I miss something? Any idea if/where I
> could find relevant logs (RabbitMQ's log in /var/log/rabbitmq) does not show
> anything...
>
> Slightly related question: what is the URL that should be configurer for
> bpth CKAN and CSW harvester? For CKAN should I link the CKAN root page or
> directly the API. For CSW, do I point the basic CSW resource of the
> "GetCapacilities"?
>
> Thank you
>
> Steph
>
>
> _______________________________________________
> ckan-discuss mailing list
> ckan-discuss at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-discuss
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-discuss
>



More information about the ckan-discuss mailing list