[ckan-discuss] Convention for recording data "quality" information

Thu Jul 14 19:45:56 BST 2011

Hi,

I am currently GMT-7 (or CET-9), with limited internet access, so I apologize for my sparse responses...

On 13.7.2011, at 18:20, Rufus Pollock wrote:

> Hi All,
> 
> We're doing work at the moment to do some quality assurance (QA)
> processing on datasets e.g. does this resource exist (i.e. not 404),
> is the API up [1], does it conform to a schema (if it has a schema)
> ... (see also this original blog post by Stefan Urbanek [2])
> 
> [1]: http://labs.mondeca.com/sparqlEndpointsStatus/index.html
> [2]: http://ckan.org/2011/01/20/data-quality-what-is-it/
> 
> Leaving aside the exact process by which that happens one question is
> how and where to record the resulting info. Here's a straw-man
> proposal. Interested in feedback, ideas based on previous experience
> etc. A summary and the basic proposal has also been posted here:
> 
> <http://wiki.ckan.net/Data_Quality>
> 
> Regards,
> 
> Rufus
> 
> ## Proposal
> 
> * Record info on the resource object metadata (Resources can have
> *arbitrary* metadata)
> 
> * Field named 'qa'.
> 
> * Structure of 'qa' field:
> 
> * last_checked: timestamp of last checked time
> * status_code: html status code
> * uptime: hash/dict keyed by period with uptime percentage
> * validatiion: tbd (need to specific error info)
> * fivestar: 5 star rating
> * historical: array with historical versions of all of these
> 

I would to design what should be recorded, I would start from the other side: from checking process and go towards resource record. 

First, the QA process will check for data availability and will store:

- resource_id
- resource_url - just in case it changes, the rest of this information is more related to this URL than to the resource
- checked_date
- resource_check_reason - why we are touching the resource? (see below)
- resource_status - our status code, see below
- http_status
- status_message – if provided in the reply
- redirect_count – might be good to know
- redirected_resource_url (or final_resource_url) - URL where the actual resource was found after redirections
We might plug-in additional HTTP information for later analysis, such as:
- resource_size 
- mime_type - actual MIME type as returned by server (what if they say text and it is xls and that will be the reason for failing?)
- last_modified

Resource check reason:

- availability/ping - only whether the resource exists, just to get latest up-to-date resource information
- cache/download - we touched the remote resource, because we wanted to cache/download it

Resource status:

- ok
- invalid – invalid URL
- not_found – 404 response
- too_deep – too many redirections (we should have some limit)
- too_big (do we need this?)
- ...

Then we might populate resource record with latest information that is required to be available in the front-end, such as:
- last_checked_date = resource_availability.checked_date
- resource_check_status = resource_availability.resource_status
- last_good_date = MAX(resource_availability.checked_date) WHERE resource_status = ok
- failure_count = COUNT(1) WHERE resource_status != ok
- cached_date - date of resource that we have successfully downloaded and cached (somewhere)

What do you think?

> Issues:
> 
> * What about historical information. E.g. for sparql status checker
> [1] there is info from several weeks. If just summary info (e.g.
> uptime over last month / year) then not too bad but if daily status
> then how do we store.

I would record all check transactions, then create summary from it. It is always good to have all information in the most detailed form as you can get it. 

Regards,

Stefan Urbanek

senior business intelligence consultant

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ckan-discuss/attachments/20110714/7e52e80d/attachment.htm>