[open-government] Data Quality Management (Stefan Urbanek)

Mon Oct 4 12:54:00 UTC 2010

>
>
>
>   1. Data Quality Management (Stefan Urbanek)
>   2. Re: [euopendata] New RTI Legislation Rating       Methodology
>
>
Stefan, excellent post.  For the Federal Government agencies and
departments, the U.S. has the Data Quality Act.  Each agency or department
must specify the best practices that exist and what they do to obtain the
best data.  Obama has updated an Executive Order to require peer review of
research.  Decisions by agencies can be challenged, and often are, if the
data quality is suspect.

Unfortunately, we have no equivalent data quality act in most states, but we
are working on that.

All the documents on Executive Orders on Data Quality and legal decisions so
far are on the web.  Please let us know how you progress.   The question you
ask is fundamental and it non-trivial because some agencies, working in
physical measures have a better time of it than those working withh soft
measures.
Dwight Hines
IndyMedia
Maine, USA

> Hi,
>
> Many of us here are working on or at least working with open data projects.
> Besides processing of published datasets in (kind of) raw format, there are
> open data/open government projects where data are being scraped from
> unstructured sources or typed in manually. The data quality might vary, as
> well as requirements for data quality that can be considered acceptable.
>
> I would like to know, whether you are considering data quality management
> in your projects? If you do, how do you approach it and in what situations?
>
> DATA QUALITY DIMENSIONS
>
> What data quality dimensions do you measure? Here are couple of them, that
> are more relevant to domain of open government:
>
> 1. Completeness - extent to which the expected attributes of data are
> provided. Data do not have to be 100% complete, the dimension is measured to
> the degree to which it matches user's expectations and data availability.
> Can be measured in an automated way.
>
> 2. Accuracy - data reflect real world state. For example: company name is
> real company name, company identifier exists in the official register of
> companies. Can be measured in an automated way using various lists and
> mappings.
>
> Data can be complete but not accurate: for example in SK public
> procurements we have 99,5% completeness and 95% accuracy for suppliers. This
> means, that almost all records have the field filled, however 5% of supplier
> identification is invalid - not matching any organization in organizations
> registry - requires further cleansing, special treatment or known/marked
> removal.
>
> 3. Credibility - extent to which the data is regarded as true and credible.
> It can vary from source to source, or even one sourced can contain automated
> and manually entered data. This is not quite measurable in an automated way.
>
> 4. Timeliness - extent to which the data is sufficiently up-to-date for the
> task at hand. For example not timely data would be scraped from unstructured
> PDF that was published today, however, contains contracts from three months
> ago. This can be measured by comparing publishing date (or scraping date)
> and dates within the data source.
>
> Other dimensions that can be measured, mostly if you have multiple datasets
> describing same objects:
> 5. Consistency - do the facts in multiple datasets match? (some measurable)
> 6. Integrity - can be multiple datasets correctly joined together? Are all
> references valid? (measurable in automated way)
>
> Of course there are more data quality dimensions that can be measured.
>
>
> ACCEPTABLE DATA QUALITY (THRESHOLDS)
>
> If you are measuring data quality, how do you set acceptable thresholds?
>
> Example 1: in Slovak public procurements we were scraping contract title,
> which went from not sufficient 20%, through 50% 66% to >99%, which was more
> than sufficient (originally required 85%).
>
> Example 2: in same project we had 66% completeness of procurement process
> type, which was considered not sufficient (required >85%) and was indicating
> that there are issues with data quality somewhere in the process. However,
> after further analysis we have found that the process type is indeed not
> available, therefore the threshold had to be lowered with explanation "1/3
> is not provided by the source".
>
> Do you use any automated notification for automated scraping? For example
> when an attribute from a weekly scraping job is not sufficiently complete.
>
> To sum it up: Do you perform data quality measurement/management? If yes:
>  How? If no:  Why?
>
> Also: do you display data quality information to the public or use it only
> internally? I've seen some sites having nice historical DQ table (mostly for
> completeness).
>
> Regards,
>
> Stefan Urbanek
> freelance consultant, data analyst
>
> knowerce
> http://www.knowerce.sk
>
> -
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-government/attachments/20101004/4ee62988/attachment.html>