[ckan-dev] Harvest cron removal

David Read david.read at hackneyworkshop.com
Thu Aug 1 10:32:35 UTC 2013


Thanks David and Adria. I'll bear in mind these constraints and hope
to get some time to try something out in the next couple of months.

David

On 1 August 2013 10:39, Adrià Mercader <adria.mercader at okfn.org> wrote:
> Hi,
>
> I agree that the current behaviour is confusing (specially the 15
> minutes message, as it is rarely 15 min the actual time), but there
> are a number of related issues here.
>
> Given the async nature of the harvesting, there needs to be a task
> that regularly checks the status of the current jobs and performs
> certain actions (all code for these in [1]):
>
> - Mark jobs as finalized (this concept was not present in ckan 1.8).
> Basically if a job has finished the gather stage and all its objects
> have a status of "complete" or "error", the job is flagged as
> finished, and the finish time stored, allowing new jobs to be created
> (refreshed) on the source.
> - Reindex the harvest source dataset when the job finishes so it has
> the latest status and the updated counts, etc can be shown on the
> frontend
> - Schedule jobs that have non-manual frequency (daily, weekly, etc)
> - Resend previously failed jobs and objects on the Redis backend
>
> This tasks need to be run regularly and right now the "run" command
> run as a cron job seems like the natural place because it needs to be
> run anyway.
>
> Having said that, we could think of a new behaviour for the "Refresh"
> button where instead of just creating a new job (if there isn't
> another one pending or running) and waiting for the run command to be
> run, we instead send the job immediately to the gather queue, thus
> avoiding the 15-3 minute wait and the confusing notice.
>
> I haven't think about all the implications of this but I think it
> should be fine. The cron jon though will still need to be run to
> perform the previous tasks (maybe with a different name "check_status"
> or similar). We will still need to keep the current setup for
> compatibility as well.
>
> Hope this makes sense,
>
> Adrià
>
>
> [1] https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/logic/action/update.py#L277
>
> On 1 August 2013 09:10, David Raznick <david.raznick at okfn.org> wrote:
>> Hello
>>
>> For data.gov we lowered the time to 3 mins.
>> The redis consumer already notices duplicates on the gather queue and is why
>> we lowered the time.  It could do a better job though of also making sure it
>> does not readd tasks that are currently running.
>>
>> The other concern with adding items to the gather stage immediately is that
>> it is quite common for the harvesters and queue to be on a different server
>> than the web server.  It is also common for these to have no communication
>> with each other as all they do is look at the same db.  So whatever happens
>> here we will have to look at keeping the old behaviour too.
>>
>> Thanks
>>
>> David
>>
>>
>> On Tue, Jul 30, 2013 at 2:24 PM, David Read <david.read at hackneyworkshop.com>
>> wrote:
>>>
>>> Cheers Ross,
>>>
>>> David
>>>
>>> On 30 July 2013 13:37, Ross Thompson <ross.thompson.ca at gmail.com> wrote:
>>> > We (Canada/data.gc.ca) are not using the harvester at the moment, so
>>> > this
>>> > does not have an impact on us.
>>> >
>>> > Thanks.
>>> >
>>> >
>>> >
>>> > On 30 July 2013 04:56, David Read <david.read at hackneyworkshop.com>
>>> > wrote:
>>> >>
>>> >> I'm keen to make the CKAN harvester start immediately when you hit the
>>> >> 'refresh source' button. Currently it has to wait for the 'harvester
>>> >> run' cron which is usually configured to occur only every 15 minutes.
>>> >>
>>> >> I get plenty of support calls from organisations who are setting up a
>>> >> harvester, are iterating through problems. They are further frustrated
>>> >> by this curious wait each time. Since the wait time is somewhat
>>> >> 'unknown' it also breeds distrust. And we all hate being made to
>>> >> context-switch away while debugging.
>>> >>
>>> >> From conversations with James, the 15 minute cron was originally
>>> >> designed to ensure that a harvest source couldn't be harvested more
>>> >> than once simultaneously and get confused. Since the harvest job
>>> >> becomes multiple gathers and further split into fetches, James was
>>> >> keen to make sure a source could only have one job for each run of the
>>> >> cron. BUT I'm sure we could get the backend (Redis if not RabbitMQ) to
>>> >> tell us the status of the job (and its related gather/fetches) and
>>> >> stop further ones being created until it is done.
>>> >>
>>> >> Would the various CKAN sites using the harvester (OKF, Frauenhofer,
>>> >> Canada, Washington etc.) be happy about this change? It would be on
>>> >> the CKAN 2.x code.
>>> >>
>>> >> Dave
>>> >>
>>> >> _______________________________________________
>>> >> ckan-dev mailing list
>>> >> ckan-dev at lists.okfn.org
>>> >> http://lists.okfn.org/mailman/listinfo/ckan-dev
>>> >> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > ckan-dev mailing list
>>> > ckan-dev at lists.okfn.org
>>> > http://lists.okfn.org/mailman/listinfo/ckan-dev
>>> > Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>>> >
>>>
>>> _______________________________________________
>>> ckan-dev mailing list
>>> ckan-dev at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>>
>>
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>>
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev




More information about the ckan-dev mailing list