[ckan-dev] Harvest cron removal

Adrià Mercader adria.mercader at okfn.org
Thu Aug 1 09:39:46 UTC 2013


Hi,

I agree that the current behaviour is confusing (specially the 15
minutes message, as it is rarely 15 min the actual time), but there
are a number of related issues here.

Given the async nature of the harvesting, there needs to be a task
that regularly checks the status of the current jobs and performs
certain actions (all code for these in [1]):

- Mark jobs as finalized (this concept was not present in ckan 1.8).
Basically if a job has finished the gather stage and all its objects
have a status of "complete" or "error", the job is flagged as
finished, and the finish time stored, allowing new jobs to be created
(refreshed) on the source.
- Reindex the harvest source dataset when the job finishes so it has
the latest status and the updated counts, etc can be shown on the
frontend
- Schedule jobs that have non-manual frequency (daily, weekly, etc)
- Resend previously failed jobs and objects on the Redis backend

This tasks need to be run regularly and right now the "run" command
run as a cron job seems like the natural place because it needs to be
run anyway.

Having said that, we could think of a new behaviour for the "Refresh"
button where instead of just creating a new job (if there isn't
another one pending or running) and waiting for the run command to be
run, we instead send the job immediately to the gather queue, thus
avoiding the 15-3 minute wait and the confusing notice.

I haven't think about all the implications of this but I think it
should be fine. The cron jon though will still need to be run to
perform the previous tasks (maybe with a different name "check_status"
or similar). We will still need to keep the current setup for
compatibility as well.

Hope this makes sense,

Adrià


[1] https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/logic/action/update.py#L277

On 1 August 2013 09:10, David Raznick <david.raznick at okfn.org> wrote:
> Hello
>
> For data.gov we lowered the time to 3 mins.
> The redis consumer already notices duplicates on the gather queue and is why
> we lowered the time.  It could do a better job though of also making sure it
> does not readd tasks that are currently running.
>
> The other concern with adding items to the gather stage immediately is that
> it is quite common for the harvesters and queue to be on a different server
> than the web server.  It is also common for these to have no communication
> with each other as all they do is look at the same db.  So whatever happens
> here we will have to look at keeping the old behaviour too.
>
> Thanks
>
> David
>
>
> On Tue, Jul 30, 2013 at 2:24 PM, David Read <david.read at hackneyworkshop.com>
> wrote:
>>
>> Cheers Ross,
>>
>> David
>>
>> On 30 July 2013 13:37, Ross Thompson <ross.thompson.ca at gmail.com> wrote:
>> > We (Canada/data.gc.ca) are not using the harvester at the moment, so
>> > this
>> > does not have an impact on us.
>> >
>> > Thanks.
>> >
>> >
>> >
>> > On 30 July 2013 04:56, David Read <david.read at hackneyworkshop.com>
>> > wrote:
>> >>
>> >> I'm keen to make the CKAN harvester start immediately when you hit the
>> >> 'refresh source' button. Currently it has to wait for the 'harvester
>> >> run' cron which is usually configured to occur only every 15 minutes.
>> >>
>> >> I get plenty of support calls from organisations who are setting up a
>> >> harvester, are iterating through problems. They are further frustrated
>> >> by this curious wait each time. Since the wait time is somewhat
>> >> 'unknown' it also breeds distrust. And we all hate being made to
>> >> context-switch away while debugging.
>> >>
>> >> From conversations with James, the 15 minute cron was originally
>> >> designed to ensure that a harvest source couldn't be harvested more
>> >> than once simultaneously and get confused. Since the harvest job
>> >> becomes multiple gathers and further split into fetches, James was
>> >> keen to make sure a source could only have one job for each run of the
>> >> cron. BUT I'm sure we could get the backend (Redis if not RabbitMQ) to
>> >> tell us the status of the job (and its related gather/fetches) and
>> >> stop further ones being created until it is done.
>> >>
>> >> Would the various CKAN sites using the harvester (OKF, Frauenhofer,
>> >> Canada, Washington etc.) be happy about this change? It would be on
>> >> the CKAN 2.x code.
>> >>
>> >> Dave
>> >>
>> >> _______________________________________________
>> >> ckan-dev mailing list
>> >> ckan-dev at lists.okfn.org
>> >> http://lists.okfn.org/mailman/listinfo/ckan-dev
>> >> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>> >
>> >
>> >
>> > _______________________________________________
>> > ckan-dev mailing list
>> > ckan-dev at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/ckan-dev
>> > Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>> >
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>
>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>




More information about the ckan-dev mailing list