[ckan-dev] CSW harvesting update

William Waites ww at styx.org
Sun Feb 6 13:43:16 UTC 2011


Just an update re: feature 885, using owslib to talk to
a CSW server instead of our own custom client. I've done
a partial merge in the feature-885-owslib branch. This
removes the ckan/lib/cswclient.py and paraphenalia and
changes the harvesting code to use the owslib implementation.

This is "partial" because though it uses the owslib client
to talk to the server, it still uses John's code to parse
the result and turn it into a CKAN package. This parsing
and transforming code should probably be replaced but
needs to be done carefully because the UKLII requirements
are stricter than generic ISO19139. This means that we
cannot use this to harvest from just any CSW service,
the Dutch national registry, for example, serves things
that we would consider invalid. Ideally we could handle
this gracefully and just have some stricter checking
for UK purposes which will come from the schematron 
validation step, but the way things are laid out this
is difficult to do directly. So as the validation hasn't
been implemented yet, we parse the document twice as an
interim measure.

The immediate benefit of using the owslib client is that
it lets us page through results, which are not all returned
in one request. This is, of course, critical as without
it we would get some fixed server-specific number of 
records and miss the rest. This now works, but also
can be improved -- at the moment there is a request to
get the brief record descriptions for their identifiers
and then makes a separate request for the detail of each
records. This would be better implemented to just get
some larger number of details in sequence, it would be
nicer of us to make fewer requests to the services that
we are aggregating.

James, I looked in some detail at your patch to owslib
and didn't seem to need it -- apart from a non-critical
passage that changes owslib's idea of which etree 
implementation to use, they make a different choice from
us, preferring, in order, external elementtree, internal
elementtree and lxml's one whereas we use lxml
unconditionally. I should note that while their APIs are
similar they are not compatible, some things like the
pretty_print argument to etree.tostring() are only
supported by lxml. I attach this small patch to this
message, and am trying to track down Sean Gillies to see
what his opinion of it is. If you could test this branch
inyour environment to make sure it does what you expect
today before your meetings tomorrow it would be appreciated.

Next on my plate here are the schematron validation and
a CSW server. The latter I think would be best implemented
as a CKAN client, essentially a standalone proxy for the
API.

Cheers,
-w
-- 
William Waites                <mailto:ww at styx.org>
http://river.styx.org/ww/        <sip:ww at styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45




More information about the ckan-dev mailing list