[ckan-discuss] Convention for recording data "quality" information

Rufus Pollock rufus.pollock at okfn.org
Thu Jul 14 02:20:29 BST 2011


Hi All,

We're doing work at the moment to do some quality assurance (QA)
processing on datasets e.g. does this resource exist (i.e. not 404),
is the API up [1], does it conform to a schema (if it has a schema)
... (see also this original blog post by Stefan Urbanek [2])

[1]: http://labs.mondeca.com/sparqlEndpointsStatus/index.html
[2]: http://ckan.org/2011/01/20/data-quality-what-is-it/

Leaving aside the exact process by which that happens one question is
how and where to record the resulting info. Here's a straw-man
proposal. Interested in feedback, ideas based on previous experience
etc. A summary and the basic proposal has also been posted here:

<http://wiki.ckan.net/Data_Quality>

Regards,

Rufus

## Proposal

* Record info on the resource object metadata (Resources can have
*arbitrary* metadata)

* Field named 'qa'.

* Structure of 'qa' field:

 * last_checked: timestamp of last checked time
 * status_code: html status code
 * uptime: hash/dict keyed by period with uptime percentage
 * validatiion: tbd (need to specific error info)
 * fivestar: 5 star rating
 * historical: array with historical versions of all of these

Issues:

* What about historical information. E.g. for sparql status checker
[1] there is info from several weeks. If just summary info (e.g.
uptime over last month / year) then not too bad but if daily status
then how do we store.



More information about the ckan-discuss mailing list