[ckan-dev] ckanext-spatial and pycsw synchronization workflow

Fri Nov 22 12:22:59 UTC 2013

Hi,

Thanks Ryan for sharing your experience on this front.

On 22 November 2013 05:38, Tom Kralidis <tomkralidis at gmail.com> wrote:
> On Wed, Nov 20, 2013 at 7:32 PM, Ryan Clark <ryan.clark at azgs.az.gov> wrote:
>> Tom -
>>
>> My approach does require a sync step, where a CKAN object is read from the database, passed through the Jinja template and the result is fed into a pycsw table. That pycsw table is in the same CKAN database, fwiw, but the table that pycsw reads from is not core-CKAN.
>>
If we set aside for a moment how the CKAN datasets were created in the
first place, I think both approaches are conceptually very similar.

1 .You get a CKAN dataset dict (1a. Ryan gets it from the logic layer
on the relevant extension points, 1b. the script in ckanext-spatial
gets it from the API, but the dicts are exactly the same)
2. You pass it to a template to get an ISO xml document
3. You feed this xml doc into pycsw, which stores it in its table.

Basically the only difference in 1 is that Ryan's approach gets
records into pycsw in realtime and in 1b it happens asynchronously
(once a day or whenever you've configured your cron job). 1a is of
course nicer, but 1b makes the implementation much easier at the
beginning and we can always move to 1a later on, so I would forget
about that for a second.

The main issue remains on step 2, how easy is to obtain a valid ISO
document from a CKAN dict. Looking at Ryan's template, it only seems
to use id, title, notes, metadata_modified, tags, extent and the
resources url, name and description. Besides these, there are other
fields hardcoded that of course relate to that particular project.

It would be superuseful to get a list of what fields would need a
valid ISO doc from CKAN (remember that CKAN supports adding extra
arbitrary fields besides the ones defined in the model)

>
>> An extension like ckanext-spatial can be configured to run that synchronization every time a dataset is updated, which makes the integration pretty seamless, even though its not quite as ideal as I think you're imagining.
>>
>
> I think a logical next step might be to support local datasets via
> synchronization, which at least gets us closer to the ultimate goal?
> The question then becomes how do we sync local records? Convert a JSON
> to XML or Ryan's approach of reading from the CKAN database?
>
> Does ckan or ckanext-spatial for that matter allow for extra fields in
> the dataset model? In the case of pycsw reading straight off the CKAN
> db, pycsw would need a few more physical columns.
Yes, you can add as many arbitrary fields as you want to a CKAN dataset.

>> One case to keep in mind: When records are harvested, the ideal CSW implementation will turn those harvested XML records around and provide them as GetRecordByID responses completely unedited. If the XML record is converted to a CKAN package during harvest, then that CKAN package is re-converted to XML on CSW request, there is pretty much no chance that the document will be the exact same as what was harvested.
The harvested records are kept untouched on the CKAN db, so you could
always access those (this is was it's basically happening in the
current implementation)

>>In that case, the approach CKAN currently takes is actually ideal. Then that leads you to, what do you do if a harvested record is edited in the CKAN interface?
>>
>
> Does CKAN allow for edited harvested records? What happens when they
> are reharvested from upstream?
CKAN won't prevent by default to edit harvested records, it would be
up to your authorization settings, but all sites that harvest datasets
prevent editing them from CKAN, otherwise you are opening a can of
worms.
When you reharvest from upstream the stored XML doc gets updated and
the CKAN dataset is updated as well to reflect the changes.

>> The default criteria for a CKAN package is not sufficient to generate a valid ISO metadata record. I just generated this package: http://demo.ckan.org/dataset/minimum-content. I was required to enter two pieces of information: a title and a url. Here's the JSON serialization of that content: https://gist.github.com/rclark/7573839.
>>
You actually don't need the resources if you create the packages via
the API, you really just need a name for the dataset, so this could be
even slimmer :)

This is probably my fault for not having it mentioned before as I
didn't want to distract the discussion. I don't think is realistic to
expect all CKAN datasets to be able to be exported to iso docs. You'll
of course require a set of fields and you'll want to run validation on
those. This will imply a custom dataset type and form so users have a
proper way to create and edit these records. This requires more work
and should be implemented in a way that it's easy to modify to your
own needs, as users will want to tweak the fields and add their own.

All this should be the ideal final stage, but it should be kept in
mind in the whole scenario.

Ryan, did you use a custom form and dataset type (IDatasetForm) with
special fields required for the ISO documents or it's just the generic
CKAN form? Can we see it in action somewhere?

Adrià

p. can't wait for the 23rd to have a look at your implementation :p
https://github.com/azgs/development/issues/3

>> Thanks for keying me in here -- sorry that this conversation somehow slipped under my radar.
>>
>> Ryan