[ckan-dev] staged schemas: extending navl schema for validation in stages

Ian Ward ian at excess.org
Tue Jun 11 15:45:12 UTC 2013


Hello,

In my CKAN deployment I have dataset validation rules that include
things like "field X, Y and Z are required if field Q is True", and I
want to include the list of missing fields in the validation failure
message for Q because those fields might not be on the same screen
(fields in resources, for example). This kind of validation is much
easier if I can specify an ordering to my validators.  If I can
arrange to validate Q after the other fields I don't have to repeat
the same validation tests within Q's validator that I do in each of X,
Y and Z.

I also notice that this pattern exists in CKAN itself.
ckan.lib.navl.dictization_functions._validate() has one run for
"__before" validators, one for the "normal" validators, one for the
"__extras" validators and one for the "__after" validators.  There are
more transformations done on the data later still. In
ckan.lib.dictization.model_dictize.resource_dictize() the url and
format fields are modified by special logic that can't be overridden.
In ckan.lib.dictization.model_save() the id field is set to a new UUID
when not provided. It would be nicer if these changes were part of a
late stage of the validation schema.  As part of the schema this
business logic would be all in one place where it can be discovered
and extended when necessary.

The current schema is not "flat", it has an (undocumented) ordering.
Why don't we make that ordering explicit?  Here is the sort of schema
I would like to be able to provide in my extension:

rocky_road_schema = [
    {
        10: { .. my normal validators: keys are field names and values
are validator lists .. },
        20: { .. my late stage validators .. },
        51: {'name': copy_from_id_when_missing}, # assuming id is
assigned at stage 50,
    },
    toolkit['get_vanilla_schema'],
]

where the numbers represent the stage of validation and the code in
ckan.lib.navl.dictization_functions._validate() would include
something like:

merged_stage_schema = {}
for inherit_schema in reversed(stage_schema):
    # stages completely replace those from inherited schemas
    merged_stage_schema.update(inherit_schema)

for stage, schema in sorted(merged_stage_schema.items()):
    full_schema = make_full_schema(data, schema)
    # the rest is much like what already exists:
    for key in sorted(full_schema, key=flattened_order_key):
        for converter in full_schema[key]:
            try:
                convert(converter, key, converted_data, errors, context)
            except StopOnError:
                break

Then we can move much of the custom logic in the dictization functions
into the 'vanilla schema', which we can document along with the stage
numbers used.

This change could be made while maintaining backwards compatibility.
This new schema is a list of "staged schema" dicts (which are dicts of
stage number: navl schema dicts), while the current schema is a single
navl dict. We could just check the type being passed (list vs. dict)
for as long as we need to support the current schemas.

Comments?


Aside:
I have an eventual goal of making these schemas representable in JSON.
 One missing part is being able to use strings for the
validators/converters.  That mostly exists with
toolkit['get_validator'] / toolkit['get_converter'] but we would need
a way to register new validators from a plugin.  For inheritance to
work we would also need a way to register staged schemas and reference
them as strings in the top level list.




More information about the ckan-dev mailing list