[ckan-dev] harvesting csw

Armin Retterath armin.retterath at gmail.com
Tue Feb 5 19:48:48 UTC 2013


Hello David,
thank you for the information, i will test it tomorrow. But a little problem:
https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/queue.py
See line 153: consumer = get_consumer('ckan.harvert.fetch','harvest_object_id')
It should read 'ckan.harvest.fetch' I think - or is this the wrong
file? Did I check out the wrong branch?
No one will fetch any record with this error :-( .
What is with the cache idea? Are there some uuids which are cached in
the filesystem?
S.th. to geonetworks CSW - it works with a lucene index and serves
really fast. We use it for distributed metadata search (in the central
geo-metadatacatalogue for germany - 117.000 records). I think that the
harvesting of ckan have a performance problem.
Thanx a lot and have a nice evening,
Armin

2013/2/5 David Read <david.read at hackneyworkshop.com>:
> Armin,
>
> 1. On ckanext-spatial there have been some model incompatibilities on
> the master branch last week which you may have caught. Update to
> latest (c6ac949) and retry. If it is not that, then check your fetch
> paster command-line specifies the correct CKAN config file. If still
> not working, send us the exception, providing more info.
>
> 3500 XML files shouldn't be more than an hour or so, I'd have thought.
> Each one requires one request to the GeoNetwork server, store in the
> database, retrieved from the database, XML parse and validation and
> storing as a CKAN package in the database. The limiting factor is most
> likely the CSW get from your server.
>
> 2. All changes are tracked, like any wiki, and to enable you to
> 'undelete' a deleted dataset, so the name is reserved going forward.
> Hopefully Options:
> * Try doing an undelete (with package_update, I guess)
> * Purge the old one and recreate - not possible since purge API has
> not been made (shame) http://trac.ckan.org/ticket/1832
> * Before deleting the old one, rename it out of the way.
>
> David
>
> On 5 February 2013 07:11, Armin Retterath <armin.retterath at gmail.com> wrote:
>> hello list,
>>
>> first question:
>> i'm trying to harvest csw (geonetwork based) into a local ckan 1.8.1b
>> instance. i use the master branch of ckan and the master
>> ckanext-harvest branch and different branches for ckanext-spatial.
>> the harvesting itself seems to make no problems last week on thursday.
>> the gathring pulls the right ids and the fetching puts the dataset
>> into the db. only some problems with validation occurs.
>> when i tried the same yesterday, the fetching queue throws errors
>> cause in the ckanext-harvest/ckanext/harvest/queue.py the harvest
>> object will not be found! it seems to be a problem of pulling the
>> harvest object from database. the fetch_consumer pulls some uuid which
>> don't exists any longer (i deleted the table entries in the database
>> and reindexed solr!). is there a buggy cache of harvest object uuids
>> somewhere which have to be cleared?
>>
>> the harvesting is very slow. we need to harvest nearly 3500 different
>> iso19139 xml files for webmapservices and want to show them in ckan
>> ;-) .
>>
>> second question:
>> alternativly i thought about doing the other (better) way and using a
>> push to publish datasets from our registry to ckan via the api (2+3).
>> the creation and delete seems to possible via the different apis (2
>> and 3-action for delete). some further problems exists: if i delete a
>> package via action interface the package is not really deleted but the
>> attribute state is set to "deteled". when i wan't recreate the object
>> with the same name (maybe local uuid of our registry) i get a 403!
>> maybe the package is not owned by my user any longer? how can this be
>> prohibited? or can i delete the whole package so that i can recreate
>> it afterwards? How can i get only those packages that i have created
>> by my own via the api?
>>
>> i think the push way is better to hold the information in sync :-)
>>
>> thanx in advance
>>
>> armin
>>
>> _______________________________________________
>> ckan-dev mailing list
>> ckan-dev at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/ckan-dev
>> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev
>
> _______________________________________________
> ckan-dev mailing list
> ckan-dev at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/ckan-dev
> Unsubscribe: http://lists.okfn.org/mailman/options/ckan-dev




More information about the ckan-dev mailing list