[ckan-discuss] Multiple package schemas

Thu Oct 7 19:13:30 BST 2010

On 7 Oct 2010, at 17:19, David Read wrote:
> Apologies, I didn't mean to suggest you were doing anything wrong -

No need to apologise, I didn't take it that way.

> I meant to say that in those two cases you *specialise* the use of the
> field, rather than go against any guidance. CKAN has been pretty
> free-form so far, to let people establish use cases. We have an aim to
> guide people more and more as usage becomes clearer, and constrain the
> form to get more codified input, to bring us towards our aim of
> automatable use of the metadata.

I understand the approach and agree with it.

> * a button to add more 'extra' fields

Yes!

> * autocomplete in extra field name, so the user can see existing names
> and be guided to them or give him confidence in creating a new one
> and as I mentioned before:

Yes! And an obvious place (on the Wiki?) where extra fields currently  
in use can be listed/proposed/discussed. Could such a link be placed  
directly on the form, in the extra fields section?

> * better facteted browse / search by extra field keys.

Yes to search by extra field key.

> But I'm a bit stumped about how to achieve "format"="dc" *and*
> "format"="foaf" in extra fields. Could we allow multiple keys the
> same? This once again another RDF precedent...

I don't know. Maybe I should have just proposed "format"="dc,foaf" or  
something like that. But would that be considered good CKAN practice?  
Could it mess up search? Please simply take this as an example of the  
kind of issue that drives users towards using tags -- because of their  
simple free-form nature, users don't have to grapple with questions  
like this one.

> Predefined values for extra fields would be included in the extra
> field schema.

Sorry I may have expressed myself poorly. I was arguing for pre- 
defined values for a *standard* field, namely the Format field on  
resources (not to be confused with the format-xyz tags we discussed  
above).

Knowing that a resource URL is actually a SPARQL protocol endpoint  
(indicated by Format value "api/sparql") is a big deal for us, and an  
important entry point for automation. Because it's a free text field,  
we see all sorts of funny variations that prevent us from reliably  
identifying the SPARQL endpoints: "API/SPARQL", "api/sparql (doesn't  
support DESCRIBE)", "api sparql", "aip/spralq", etc etc. So having  
them pick "api/sparql" (or "SPARQL protocol endpoint (api/sparql)")  
from a list would significantly improve data quality.

But as I said, I do realize that a lot of these formats are hardly  
used outside of the LOD realm, so the challenge is how to offer pre- 
defined values for LOD users while not constraining/confusing users  
from other communities.

That's why I was proposing that a “schema” could optionally declare a  
number of pre-defined format values, and when someone edits a package  
with that schema in effect, then these format values would be  
available via auto-complete or combo box.

> As for the other current form fields, the value field
> could be represented by a combo-box, date field, check-box, multiple
> input boxes etc.. However you input this field, the form will
> translate it into a corresponding text value that is actually stored
> in the db.

Ok, got it, this will be handy.

> As for ordering these extra fields to get the author_url field near to
> the author and author_email fields, I wonder if this problem could be
> solved instead by making all of these fields compulsory? If we are
> going to allow multiple 'extra field schemas' per package, then the
> fields would have to be grouped on the package edit page, to make
> sense I think.

Sorry, you lost me here, what do you mean by compulsory above?

Richard

>
> David
>
> On 7 October 2010 16:57, Richard Cyganiak <richard at cyganiak.de> wrote:
>> On 7 Oct 2010, at 09:23, David Read wrote:
>>>
>>> Richard's guidance doesn't contradict any of our core field  
>>> guidance,
>>
>> That's deliberate.
>>
>>> apart from in these cases:
>>>
>>> * he gives more specific instructions for a couple of the resource
>>> fields the format field has suggested values like
>>> "application/rdf+xml" which is in fact two pieces of data - the
>>> purpose of the download (e.g. the application, an example, meta- 
>>> info,
>>> download_page) and the format itself. These would be better in
>>> separate columns.
>>
>> Rufus has stated at some point that the content of the format field  
>> should
>> be an Internet Media Type [1], and he encouraged the use of made-up
>> “pseudo-types” like “api/search”. So I blame the idea on him ;-)
>>
>> I agree that having a “format” field (with media type as value where
>> possible) and a separate “purpose” or “type” field with values such  
>> as
>> “Download”, “Example”, “Schema”, “Documentation”, “API” would be  
>> good.
>>
>>> * he suggests adding a number of tags according to the properties of
>>> the package. I think these would be better stored as extra fields,
>>
>> Again, things like the “format-rdf” tag were already widely used on  
>> CKAN
>> before we started, so again I don't accept the blame ;-)
>>
>>> I think he (and others) have chosen tags over extra fields,  
>>> because tags
>>> are easier to browse/search on CKAN.
>>
>> That's not the main reason. I think the main reasons for choosing  
>> tags over
>> custom fields are:
>>
>> 1. Tags are more “lightweight”. Coining a new custom field can be a  
>> bit
>> scary, because it feels like we might perhaps be “polluting” the  
>> space of
>> field names. Tags are free-form, so there is less concern about  
>> coining new
>> ones.
>>
>> 2. There is no way (as far as I can see) to check if a given custom  
>> field
>> name has already been used elsewhere, so if I use a “format” or  
>> “topic”
>> custom field I don't know if I'm stepping on someone else's toe
>>
>> 3. Working with custom fields is quite awkward because of the three- 
>> fields
>> limitation in the form.
>>
>> 4. Custom fields are single-value, so you can't say "format"="dc"  
>> *and*
>> "format"="foaf"
>>
>> I'm not sure what this implies for the design of CKAN, just sharing
>> experience.
>>
>>> If we resolved these two points, I think the LOD use case would
>>> suggest a schema that just describes extra fields.
>>
>> Not quite. I think that some things can't be solved with just extra  
>> fields:
>>
>> 1. Pre-defined values for the format field of resources. This is very
>> important. This field is the basis for any kind of automated access  
>> to the
>> data package; free-form text just doesn't cut it. Some of the  
>> formats that
>> are commonly used in the LOD realm are virtually unknown elsewhere,  
>> so the
>> values would have to be per-schema I think.
>>
>> 2. Positioning of custom fields. The most obviously missing field  
>> is “author
>> homepage”. You wouldn't believe how many LOD packages have a  
>> homepage URL
>> stuck behind the author name, or in the email field. Having an  
>> “author
>> homepage” custom field half a screen down from the “author name/ 
>> email”
>> fields doesn't feel like it would solve this; the custom field  
>> would have to
>> be located close to the name/email fields.
>>
>> These are the biggies I think. Everything else could perhaps be  
>> done via
>> extra fields.
>>
>> Richard
>>
>>
>>
>>>
>>> David
>>>
>>> On 6 October 2010 22:53, Tim McNamara  
>>> <paperless at timmcnamara.co.nz> wrote:
>>>>
>>>> On 7 October 2010 06:58, Richard Cyganiak <richard at cyganiak.de>  
>>>> wrote:
>>>>>
>>>>> On 6 Oct 2010, at 18:17, David Read wrote:
>>>>>>
>>>>>> Excellent point. Yes, maybe we want a 'schema' to merely define
>>>>>> specific 'extra' fields, with their validation and later their
>>>>>> display. Then you could have a package having several 'schemas'  
>>>>>> quite
>>>>>> simply. The core package fields then wouldn't be affect by any of
>>>>>> this.
>>>>>
>>>>> But 'schemas' still might want to modify the behaviour of some  
>>>>> of the
>>>>> core
>>>>> fields:
>>>>>
>>>>> - add a note underneath the field
>>>>> - provide a selection of choices for the resource format field
>>>>> - provide a number of checkboxes to add specific tags with special
>>>>> meenings
>>>>> - ...
>>>>
>>>> Would this level of flexibility be desirable? It may it things very
>>>> difficult to build applications on the basis of CKAN's packages  
>>>> if they
>>>> have
>>>> different structures. I prefer the idea of a common set of  
>>>> information
>>>> that
>>>> is fixed with possible extensions. I think there should be a strong
>>>> community push to keep to the common set unless there are  
>>>> compelling
>>>> reasons
>>>> (necessity) to add an extension.
>>>>
>>>> Tim.
>>>>
>>
>>