[open-government] Data Quality Management
Stefan Urbanek
stefan.urbanek at gmail.com
Sun Oct 3 23:20:14 UTC 2010
Hi,
Many of us here are working on or at least working with open data projects. Besides processing of published datasets in (kind of) raw format, there are open data/open government projects where data are being scraped from unstructured sources or typed in manually. The data quality might vary, as well as requirements for data quality that can be considered acceptable.
I would like to know, whether you are considering data quality management in your projects? If you do, how do you approach it and in what situations?
DATA QUALITY DIMENSIONS
What data quality dimensions do you measure? Here are couple of them, that are more relevant to domain of open government:
1. Completeness - extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user's expectations and data availability. Can be measured in an automated way.
2. Accuracy - data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings.
Data can be complete but not accurate: for example in SK public procurements we have 99,5% completeness and 95% accuracy for suppliers. This means, that almost all records have the field filled, however 5% of supplier identification is invalid - not matching any organization in organizations registry - requires further cleansing, special treatment or known/marked removal.
3. Credibility - extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.
4. Timeliness - extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source.
Other dimensions that can be measured, mostly if you have multiple datasets describing same objects:
5. Consistency - do the facts in multiple datasets match? (some measurable)
6. Integrity - can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)
Of course there are more data quality dimensions that can be measured.
ACCEPTABLE DATA QUALITY (THRESHOLDS)
If you are measuring data quality, how do you set acceptable thresholds?
Example 1: in Slovak public procurements we were scraping contract title, which went from not sufficient 20%, through 50% 66% to >99%, which was more than sufficient (originally required 85%).
Example 2: in same project we had 66% completeness of procurement process type, which was considered not sufficient (required >85%) and was indicating that there are issues with data quality somewhere in the process. However, after further analysis we have found that the process type is indeed not available, therefore the threshold had to be lowered with explanation "1/3 is not provided by the source".
Do you use any automated notification for automated scraping? For example when an attribute from a weekly scraping job is not sufficiently complete.
To sum it up: Do you perform data quality measurement/management? If yes: How? If no: Why?
Also: do you display data quality information to the public or use it only internally? I've seen some sites having nice historical DQ table (mostly for completeness).
Regards,
Stefan Urbanek
freelance consultant, data analyst
knowerce
http://www.knowerce.sk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-government/attachments/20101004/ef429e4b/attachment.html>
More information about the open-government
mailing list