[School-of-data] Help Defining Dataset Definition and Quality Parameters

Tony.Hirst Tony.Hirst at open.ac.uk
Tue Apr 30 09:46:59 UTC 2013

To add to Michael's list, a few detailed things I've found from experience that you may be able to generalise back up from...:

(2) downloadable data - as well as being downloadable, does the file open and contain what it claims to? (I appreciate this adds even more burden to the checking!) Sometimes files can be empty too (which could be checked by looking at file sizes).
(3) Machine readable format/doc formats - often documents aren't in the format they claim; for example .csv files that are actually Excel (.xls) files.

How frequently is data uploaded/how current/timely does it appear to be/are there publication schedules for data releases and are they stuck to?

Another thing to look for is the extent to which shared data is understandable (metadata requirements):
- is there a description somewhere of what column headings relate to, for example?
- are common vocabularies/standard terms/unique universal identifiers used to identify things (for example, in spending data, is there a company identifier as well as company name). USe of standard terms/identifiers makes it easier to cross-reference datasets
- where you have numerical columns/data is it clear what the units are? Are the numbers actually numbers or are they tainted with units (do columns of spend amounts include non numerical symboles (as in $121,343 rather than 121343 and the units specified in the column heading/metadata).

Where datasets are supposed to be released in a standard form, are they?

Another important aspect of data quality is the extent to which the data is actually "correct" and the extent to which is meets the needs of whatever reason there was for generating/using the data.

Another place to look would be reports on open data initiatives eg the UK National Audit Office report on UK open data http://www.nao.org.uk/wp-content/uploads/2012/04/10121833.pdf


From: Michael Bauer [michael.bauer at okfn.org]
Sent: Tuesday, April 30, 2013 8:06 AM
To: Mailing list for the School of Data,        a joint initiative of the OKFN and P2PU
Cc: Alon Peled
Subject: Re: [School-of-data] Help Defining Dataset Definition and Quality Parameters


There a several things that are crucial if I think of the quality of open

1) What kind of datasets are released. Is it hot and spicy stuff like
company registers, election data, budget data, transport data etc. or is it
something like the dog register, positions of public toilets etc. - this
can easily be done manually - if you do this: please submit to the OpenData
Census:  http://census.okfn.org/country/

2) Is the data actually downloadable? There is someone in Austria who does
this automatically for austria - some datasets are linked but the link is
old and just goes somewhere

3) Is it in machine readable format (or just .pdf or .doc files) (yes some
open data portals consider this data)

4) How finely grained is the data: eg. the city of vienna publishes their
budget as open data - but only the highly aggregated version: each dataset
has about 5 datapoints. So while the budget is there, it's pretty useless
for analysis.

5) Licenses: is it actually a license that is open according to the open
definition: http://opendefinition.org/ (German dataportals like to use
non-open licenses eg, as does the EU)

I'd try to work down these five criteria, explain why each of them is
important and base my analysis on it.


On Mon, Apr 29, 2013 at 01:15:17PM +0100, Tarek Amr wrote:
> As far as I understood, there should be two approaches to do so.
> (A) You can do it manually, sort of. Let's say you set some rules to
> measure collaboration. Number of edits for each file, number of people who
> edit it, may be quantity of discussions, process log files, you name it.
> (B) Learn a computer to do this for you. In such case, you need some files
> or records that you know they represent collaboration, and some that don't.
> They you learn a classifier on the attributes of those records or resources
> and then use it to tell if the whole data in general represent
> collaboration or not.
> Regarding the quality of the data, I guess you should check the following:
> - Is it easy to transform the data to an open format a computer can read
> and process
> - Is it easy to extract some features from the data (for example: number of
> edits on each file, their data, who edited them, etc)
> - Aren't there any missing data
> Anyway, those are my (not even) $ 0.002, so will wait for more experienced
> ones to add their input here
> On Mon, Apr 29, 2013 at 12:49 PM, Matan Rotman <matan.rotman at gmail.com>wrote:
> > Hello all,
> >
> > My name is Matan Rotman, and I'm a student at Hebrew University majoring
> > Political Science. As a part of my studying, I'm writing a paper that tries
> > to understand whether the Israel open data program is efficient
> > (Collaberative-wise), and if not, why not (hence, why wouldn't
> > administrative dept. won't cooperate with the program). The first thing I
> > need to do for that, though, is to understand if the datasets that are on
> > the website are of quality or not. As i'm not a technical guy, I could use
> > some help understanding what would be considered as the definition of a
> > dataset (hopefully, as particular as possible), and more important, I could
> > really use for some help with defining quality parameters so I could
> > measure the quality of the different files and sets uploaded.
> >
> > The website is at http://data.gov.il (all in Hebrew though), and I'd love
> > any help on the subject possible
> >
> > P.S
> > I hope this is the right place to ask, and I'm going to ask also at Open
> > government mailing list, so I apologize if you get my message twice.
> >
> > Best Regards,
> > Matan
> >
> > _______________________________________________
> > School-of-data mailing list
> > School-of-data at lists.okfn.org
> > http://lists.okfn.org/mailman/listinfo/school-of-data
> > Unsubscribe: http://lists.okfn.org/mailman/options/school-of-data
> >
> >
> --
> Best Regards
> Tarek Amr
> http://about.me/tarekamr

> _______________________________________________
> School-of-data mailing list
> School-of-data at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/school-of-data
> Unsubscribe: http://lists.okfn.org/mailman/options/school-of-data

Data Wrangler with the Open Knowledge Foundation (OKFN.org)
GPG/PGP key: http://tentacleriot.eu/mihi.asc
Twitter: @mihi_tr Skype: mihi_tr

School-of-data mailing list
School-of-data at lists.okfn.org
Unsubscribe: http://lists.okfn.org/mailman/options/school-of-data

The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302).

More information about the school-of-data mailing list