[open-government] Data Quality Management (Stefan Urbanek)

Uhlir, Paul PUhlir at nas.edu
Mon Oct 4 13:27:55 UTC 2010


Dwight, you point out an important piece of federal legislation in this area. While I don't want to get into the relative merits of the law, I will point out that it is seen by many as a double-edged sword, since it has been used by well-funded opponents of various regulations (health, environmental, safety, etc) to question their validity and harass the underlying data providers. At the same time, the additional transparency it provides in the system is a positive value overall. States seeking to emulate the federal model should factor in the various lessons learned in its implementation.

For many freely available reports on various data quality issues published by my institution, the National Academies (Academies of Sciences, Engineering, and Medicine), see National Academies Press (www.nap.edu<http://www.nap.edu/>). Type in "data quality" in the search box and about 600 reports will come up, some more relevant than others...

Cheers,

Paul


Paul F. Uhlir, J.D.
Director, NRC Board on Research Data and Information
The National Academies, Keck-511
500 Fifth Street NW
Washington, DC 20001
USA
Tel. + 1 202 334 1531
Fax + 1 202 334 2231
Email: puhlir at nas.edu
Web: http://www.nationalacademies.org/brdi

________________________________
From: open-government-bounces at lists.okfn.org [mailto:open-government-bounces at lists.okfn.org] On Behalf Of Dwight Hines
Sent: Monday, October 04, 2010 8:54 AM
To: open-government at lists.okfn.org
Subject: Re: [open-government] Data Quality Management (Stefan Urbanek)



  1. Data Quality Management (Stefan Urbanek)
  2. Re: [euopendata] New RTI Legislation Rating       Methodology

Stefan, excellent post.  For the Federal Government agencies and departments, the U.S. has the Data Quality Act.  Each agency or department must specify the best practices that exist and what they do to obtain the best data.  Obama has updated an Executive Order to require peer review of research.  Decisions by agencies can be challenged, and often are, if the data quality is suspect.

Unfortunately, we have no equivalent data quality act in most states, but we are working on that.


All the documents on Executive Orders on Data Quality and legal decisions so far are on the web.  Please let us know how you progress.   The question you ask is fundamental and it non-trivial because some agencies, working in physical measures have a better time of it than those working withh soft measures.
Dwight Hines
IndyMedia
Maine, USA





Hi,

Many of us here are working on or at least working with open data projects. Besides processing of published datasets in (kind of) raw format, there are open data/open government projects where data are being scraped from unstructured sources or typed in manually. The data quality might vary, as well as requirements for data quality that can be considered acceptable.

I would like to know, whether you are considering data quality management in your projects? If you do, how do you approach it and in what situations?

DATA QUALITY DIMENSIONS

What data quality dimensions do you measure? Here are couple of them, that are more relevant to domain of open government:

1. Completeness - extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user's expectations and data availability. Can be measured in an automated way.

2. Accuracy - data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings.

Data can be complete but not accurate: for example in SK public procurements we have 99,5% completeness and 95% accuracy for suppliers. This means, that almost all records have the field filled, however 5% of supplier identification is invalid - not matching any organization in organizations registry - requires further cleansing, special treatment or known/marked removal.

3. Credibility - extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.

4. Timeliness - extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source.

Other dimensions that can be measured, mostly if you have multiple datasets describing same objects:
5. Consistency - do the facts in multiple datasets match? (some measurable)
6. Integrity - can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)

Of course there are more data quality dimensions that can be measured.


ACCEPTABLE DATA QUALITY (THRESHOLDS)

If you are measuring data quality, how do you set acceptable thresholds?

Example 1: in Slovak public procurements we were scraping contract title, which went from not sufficient 20%, through 50% 66% to >99%, which was more than sufficient (originally required 85%).

Example 2: in same project we had 66% completeness of procurement process type, which was considered not sufficient (required >85%) and was indicating that there are issues with data quality somewhere in the process. However, after further analysis we have found that the process type is indeed not available, therefore the threshold had to be lowered with explanation "1/3 is not provided by the source".

Do you use any automated notification for automated scraping? For example when an attribute from a weekly scraping job is not sufficiently complete.

To sum it up: Do you perform data quality measurement/management? If yes:  How? If no:  Why?

Also: do you display data quality information to the public or use it only internally? I've seen some sites having nice historical DQ table (mostly for completeness).

Regards,

Stefan Urbanek
freelance consultant, data analyst

knowerce
http://www.knowerce.sk

-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-government/attachments/20101004/0e715c86/attachment-0001.html>


More information about the open-government mailing list