[ckan-discuss] Convention for recording data "quality" information
Stefan Urbanek
stefan.urbanek at gmail.com
Thu Jul 14 19:21:44 BST 2011
On 14.7.2011, at 3:17, Sam Smith wrote:
> How is uptime related to the qa of a dataset content?
>
Agreed. This is just "nice to know" statistical information and can be computed later if necessary.
> There appear to be significantly different and unrelated concepts in a single number.
> Or is uptime the wrong word for the concept you mention?
>
> Service reliability is a very different issue to the the data quality within that service and should be findable via a different query, not part of a dataset record, which the * rating should be
>
Agreed, that's why we should not mix the two together. See my next post - reply to the OP, which I am just about to write.
Stefan
> Sam
>
> On 14 Jul 2011, at 02:20, Rufus Pollock <rufus.pollock at okfn.org> wrote:
>
>> Hi All,
>>
>> We're doing work at the moment to do some quality assurance (QA)
>> processing on datasets e.g. does this resource exist (i.e. not 404),
>> is the API up [1], does it conform to a schema (if it has a schema)
>> ... (see also this original blog post by Stefan Urbanek [2])
>>
>> [1]: http://labs.mondeca.com/sparqlEndpointsStatus/index.html
>> [2]: http://ckan.org/2011/01/20/data-quality-what-is-it/
>>
>> Leaving aside the exact process by which that happens one question is
>> how and where to record the resulting info. Here's a straw-man
>> proposal. Interested in feedback, ideas based on previous experience
>> etc. A summary and the basic proposal has also been posted here:
>>
>> <http://wiki.ckan.net/Data_Quality>
>>
>> Regards,
>>
>> Rufus
>>
>> ## Proposal
>>
>> * Record info on the resource object metadata (Resources can have
>> *arbitrary* metadata)
>>
>> * Field named 'qa'.
>>
>> * Structure of 'qa' field:
>>
>> * last_checked: timestamp of last checked time
>> * status_code: html status code
>> * uptime: hash/dict keyed by period with uptime percentage
>> * validatiion: tbd (need to specific error info)
>> * fivestar: 5 star rating
>> * historical: array with historical versions of all of these
>>
>> Issues:
>>
>> * What about historical information. E.g. for sparql status checker
>> [1] there is info from several weeks. If just summary info (e.g.
>> uptime over last month / year) then not too bad but if daily status
>> then how do we store.
Stefan Urbanek
senior business intelligence consultant
http://knowerce.com
More information about the ckan-discuss
mailing list