[ckan-dev] harvesting csw

Wed Feb 6 15:52:53 UTC 2013

Armin,

How long is 3500 records taking, that you consider slow?

Yes, the mispell of "harvert" in the queue name is fine, since the
same mispell is used for creating the queue and getting callbacks from
it, so it still works.

Your issue seems to be with IDs. If Adria's suggestion about flushing
old IDs doesn't help, all I can suggest is trying to find the problem
from the log output. It works fine for several CKANs around the world
with different teams, so I doubt there is anything fundamentally wrong
with the ckanext-harvest or ckanext-spatial code, and we've checked
you're on the same commit.

David

On 6 February 2013 11:06, Adrià Mercader <adria.mercader at okfn.org> wrote:
> Hi Armin,
>
> Gald to hear you are giving CKAN a go.
>
> See comments below:
>
> On 5 February 2013 07:11, Armin Retterath <armin.retterath at gmail.com> wrote:
>> hello list,
>>
>> first question:
>> i'm trying to harvest csw (geonetwork based) into a local ckan 1.8.1b
>> instance. i use the master branch of ckan and the master
>> ckanext-harvest branch and different branches for ckanext-spatial.
>> the harvesting itself seems to make no problems last week on thursday.
>> the gathring pulls the right ids and the fetching puts the dataset
>> into the db. only some problems with validation occurs.
>> when i tried the same yesterday, the fetching queue throws errors
>> cause in the ckanext-harvest/ckanext/harvest/queue.py the harvest
>> object will not be found! it seems to be a problem of pulling the
>> harvest object from database. the fetch_consumer pulls some uuid which
>> don't exists any longer (i deleted the table entries in the database
>> and reindexed solr!). is there a buggy cache of harvest object uuids
>> somewhere which have to be cleared?
> There is no cache. The gather stage will create and store harvest
> objects in the database and send their ids to the fetch queue.
> If you stop the process and delete the table entries, the object ids
> are still in the queue and when the consumer is started again it will
> receive ids which are no longer in the database, thus failing.
> If for some reason you cleared up the database table you need to clear
> the queue to get rid of the old ids (you can do this with rabbitmqctl,
> on ckan 2.0 we have added a command to make it easier)
>
>>
>> the harvesting is very slow. we need to harvest nearly 3500 different
>> iso19139 xml files for webmapservices and want to show them in ckan
>> ;-) .
> The harvesting time, as David pointed out, will depend on the
> requesting of documents, parsing of the XML files and creation of CKAN
> datasets.
> Surely the harvester can be improved in some ways, and for CKAN 2.0 we
> have refactored them to make them more fast and efficient, but 3500
> datasets should work fine, even if they take a while to be harvested.
>
>>
>> second question:
>> alternativly i thought about doing the other (better) way and using a
>> push to publish datasets from our registry to ckan via the api (2+3).
>> the creation and delete seems to possible via the different apis (2
>> and 3-action for delete).
> You should be able to perform all operations with just the version 3
> API, if not it is a bug.
>
>  some further problems exists: if i delete a
>> package via action interface the package is not really deleted but the
>> attribute state is set to "deteled". when i wan't recreate the object
>> with the same name (maybe local uuid of our registry) i get a 403!
>> maybe the package is not owned by my user any longer? how can this be
>> prohibited?
> You are not allowed to create a dataset with an existing name.
> Datasets are revisioned and changes can be rolled back, so we need to
> keep track of them.
>
>  or can i delete the whole package so that i can recreate
>> it afterwards?
> You can delete and recreate, or update the existing one.
> Maybe a purge API call would indeed be useful.
>
>
> On 5 February 2013 19:48, Armin Retterath <armin.retterath at gmail.com> wrote:
>> Hello David,
>> thank you for the information, i will test it tomorrow. But a little problem:
>> https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/queue.py
>> See line 153: consumer = get_consumer('ckan.harvert.fetch','harvest_object_id')
>> It should read 'ckan.harvest.fetch' I think - or is this the wrong
>> file? Did I check out the wrong branch?
>> No one will fetch any record with this error :-( .
> Although unfortunate, this typo (which will be fixed on future
> versions of CKAN), this should not affect the harvesting as it has
> been in the codebase for a long time.
>
>> What is with the cache idea? Are there some uuids which are cached in
>> the filesystem?
> No, see above.
>
>
>> S.th. to geonetworks CSW - it works with a lucene index and serves
>> really fast. We use it for distributed metadata search (in the central
>> geo-metadatacatalogue for germany - 117.000 records). I think that the
>> harvesting of ckan have a performance problem.
> CKAN also uses Solr as search index, sometimes with a large number of
> datasets, and it is indeed very fast.
> But searching is not harvesting, as mentioned before harvesting
> involves more time-consuming tasks, which we are making our best to
> improve.
>
>
> Hope this helps,
>
>
> Adrià
>
>
>> Thanx a lot and have a nice evening,
>> Armin
>>
>> 2013/2/5 David Read <david.read at hackneyworkshop.com>:
>>> Armin,
>>>
>>> 1. On ckanext-spatial there have been some model incompatibilities on
>>> the master branch last week which you may have caught. Update to
>>> latest (c6ac949) and retry. If it is not that, then check your fetch
>>> paster command-line specifies the correct CKAN config file. If still
>>> not working, send us the exception, providing more info.
>>>
>>> 3500 XML files shouldn't be more than an hour or so, I'd have thought.
>>> Each one requires one request to the GeoNetwork server, store in the
>>> database, retrieved from the database, XML parse and validation and
>>> storing as a CKAN package in the database. The limiting factor is most
>>> likely the CSW get from your server.
>>>
>>> 2. All changes are tracked, like any wiki, and to enable you to
>>> 'undelete' a deleted dataset, so the name is reserved going forward.
>>> Hopefully Options:
>>> * Try doing an undelete (with package_update, I guess)
>>> * Purge the old one and recreate - not possible since purge API has
>>> not been made (shame) http://trac.ckan.org/ticket/1832
>>> * Before deleting the old one, rename it out of the way.
>>>
>>> David
>>>
>>> On 5 February 2013 07:11, Armin Retterath <armin.retterath at gmail.com> wrote:
>>>> hello list,
>>>>
>>>> first question:
>>>> i'm trying to harvest csw (geonetwork based) into a local ckan 1.8.1b
>>>> instance. i use the master branch of ckan and the master
>>>> ckanext-harvest branch and different branches for ckanext-spatial.
>>>> the harvesting itself seems to make no problems last week on thursday.
>>>> the gathring pulls the right ids and the fetching puts the dataset
>>>> into the db. only some problems with validation occurs.
>>>> when i tried the same yesterday, the fetching queue throws errors
>>>> cause in the ckanext-harvest/ckanext/harvest/queue.py the harvest
>>>> object will not be found! it seems to be a problem of pulling the
>>>> harvest object from database. the fetch_consumer pulls some uuid which
>>>> don't exists any longer (i deleted the table entries in the database
>>>> and reindexed solr!). is there a buggy cache of harvest object uuids
>>>> somewhere which have to be cleared?
>>>>
>>>> the harvesting is very slow. we need to harvest nearly 3500 different
>>>> iso19139 xml files for webmapservices and want to show them in ckan
>>>> ;-) .
>>>>
>>>> second question:
>>>> alternativly i thought about doing the other (better) way and using a
>>>> push to publish datasets from our registry to ckan via the api (2+3).
>>>> the creation and delete seems to possible via the different apis (2
>>>> and 3-action for delete). some further problems exists: if i delete a
>>>> package via action interface the package is not really deleted but the
>>>> attribute state is set to "deteled". when i wan't recreate the object
>>>> with the same name (maybe local uuid of our registry) i get a 403!
>>>> maybe the package is not owned by my user any longer? how can this be
>>>> prohibited? or can i delete the whole package so that i can recreate
>>>> it afterwards? How can i get only those packages that i have created
>>>> by my own via the api?
>>>>
>>>> i think the push way is better to hold the information in sync :-)
>>>>
>>>> thanx in advance
>>>>
>>>> armin
>>>>
>>>> _______________________________________________
>>>> ckan-dev mailing list
>>>> ckan-dev at lists.okfn.org
>>>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>>>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>>>
>>> _______________________________________________
>>> ckan-dev mailing list
>>> ckan-dev at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev