[ckan-discuss] Convention for recording data "quality" information

Stefan Urbanek stefan.urbanek at gmail.com
Thu Jul 14 19:21:44 BST 2011


On 14.7.2011, at 3:17, Sam Smith wrote:

> How is uptime related to the qa of a dataset content? 
> 

Agreed. This is just "nice to know" statistical information and can be computed later if necessary.

> There appear to be significantly different and unrelated concepts in a single number. 
> Or is uptime the wrong word for the concept you mention?
> 
> Service reliability is a very different issue to the the data quality within that service and should be findable via a different query, not part of a dataset record, which the * rating should be
> 

Agreed, that's why we should not mix the two together. See my next post - reply to the OP, which I am just about to write.

Stefan

> Sam
> 
> On 14 Jul 2011, at 02:20, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> 
>> Hi All,
>> 
>> We're doing work at the moment to do some quality assurance (QA)
>> processing on datasets e.g. does this resource exist (i.e. not 404),
>> is the API up [1], does it conform to a schema (if it has a schema)
>> ... (see also this original blog post by Stefan Urbanek [2])
>> 
>> [1]: http://labs.mondeca.com/sparqlEndpointsStatus/index.html
>> [2]: http://ckan.org/2011/01/20/data-quality-what-is-it/
>> 
>> Leaving aside the exact process by which that happens one question is
>> how and where to record the resulting info. Here's a straw-man
>> proposal. Interested in feedback, ideas based on previous experience
>> etc. A summary and the basic proposal has also been posted here:
>> 
>> <http://wiki.ckan.net/Data_Quality>
>> 
>> Regards,
>> 
>> Rufus
>> 
>> ## Proposal
>> 
>> * Record info on the resource object metadata (Resources can have
>> *arbitrary* metadata)
>> 
>> * Field named 'qa'.
>> 
>> * Structure of 'qa' field:
>> 
>> * last_checked: timestamp of last checked time
>> * status_code: html status code
>> * uptime: hash/dict keyed by period with uptime percentage
>> * validatiion: tbd (need to specific error info)
>> * fivestar: 5 star rating
>> * historical: array with historical versions of all of these
>> 
>> Issues:
>> 
>> * What about historical information. E.g. for sparql status checker
>> [1] there is info from several weeks. If just summary info (e.g.
>> uptime over last month / year) then not too bad but if daily status
>> then how do we store.

Stefan Urbanek

senior business intelligence consultant
http://knowerce.com






More information about the ckan-discuss mailing list