[ckan-dev] avoid harvesting duplicated datasets from different instances

Tue Oct 1 16:27:23 UTC 2019

Hello all,

I would like to harvest two different CSW instances in
- National scale:
http://gdk.gdi-de.org/gdi-de/srv/ger/csw?service=CSW&request=GetCapabilities
- State scale:
http://geoportal.bayern.de/csw/bvv?service=CSW&request=GetCapabilities

The tricky point is that all the state data are theoretically harvested by national instance however practically it is not true.
So I need to harvest both instances but I do not to store/harvest the duplicated datasets. So it means that the "gmd:fileIdentifier" which is unique for both instances should be checked before the datasets copied into ckan database.
Look at these two examples:

- National instance:
http://gdk.gdi-de.org/gdi-de/srv/ger/csw?service=CSW&request=GetRecordById&elementSetName=full&service=CSW&version=2.0.2&OutputSchema=csw:IsoRecord&id=e0eddd10-007a-11e0-be74-0000779eba3a

- State Bavaraia:
http://geoportal.bayern.de/csw/bvv?service=CSW&request=GetRecordById&elementSetName=full&service=CSW&version=2.0.2&OutputSchema=csw:IsoRecord&id=e0eddd10-007a-11e0-be74-0000779eba3a

Now my question is: is there any way to avoid harvesting redundant datasets?
for example adding any condition in the configuration part or adding any line directly in the related part of the code?

Best regards
Mani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-dev/attachments/20191001/ce2c64c3/attachment.html>