[ok-scotland] The Illusion Of Perfection - do it once right, or do it multiple times not quite right - which is better?
Jane.Morgan at scotland.gsi.gov.uk
Jane.Morgan at scotland.gsi.gov.uk
Fri Nov 21 17:45:46 UTC 2014
Various exchanges have been sparked off by draft open data strategy. We are grateful for all will look at them carefully for relevance to strategy but I should make one point now.
The strategy will be deliberately high level and short. We need to drive a culture change here and as such need the attention of public sector leaders- who will not want more than main messages- particularly around benefits.
As you may have noticed form the text, we also plan a tool-kit and some of the issues you address may better form part of that.
That said, I had thought we made Dave’s point highlighted below- but it may be I have just spoken it in presentations ! If so we will get it in!
From: ok-scotland [mailto:ok-scotland-bounces at lists.okfn.org] On Behalf Of David Stafford
Sent: 21 November 2014 15:44
To: ok-scotland at lists.okfn.org
Cc: Fiona Wilbraham; Henry Sullivan
Subject: [ok-scotland] The Illusion Of Perfection - do it once right, or do it multiple times not quite right - which is better?
Hello again -
I'd like to address the tiny elephant in the room, the one I saw lurking at the very bottom of Kate's message.
"beware that the best becomes the enemy of the good".
I really, really want to say something now, that is based on 15 years of work on a lot of very large, and often complex, data sets, many with complex, multi-layered categorisation schemes, and other wonderful attributes - and that is this:
I believe there is an illusion, mostly, among middle managers, and some, senior managers, that some data analysts (such as myself) only work in one mode: the "best" mode - that they only want perfect data, that they are unwilling to settle for less, and, worst of all, that they work really slowly, they "waste time" cleaning up data more than it needs to be cleaned, and a host of other "data crimes" which I've been accused of so many times that I've lost track, and, somewhat, I've lost interest.
I want to shatter this illusion once and for all, because it's simply mostly untrue. First of all, I am willing to accept data that is less than perfect; even far less than perfect; as long as it is identified as such. As long as the data owner or creator has stated the exceptions, perhaps stating that this data set should not be used for a certain kind of calculation, because some of the numbers in the data set have not been validated or verified. If I know that, I know I can trust part of the data, but, not certain parts. I can then work around the untrustworthy parts, or the missing parts, or whatever the issue with that data set is.
Now - with regards to this perfection lark. Yes, I'd like all data to be perfect, sure, who wouldn't. As a human being who has seen the horrors of what most junior data entry clerks enter into databases, who has looked and looked in total shock and horror at the terrifically poor quality of most business data, which often looks as if it were authored by a five year old child, it's that full of errors and problems. So for me to expect "perfect data", in a world like the one we live in - is ridiculous! How could I - when I am faced with the reality of the poor, poor quality of huge segments of business data - across the WORLD.
Here's what I would say of perfection - yes, for me, it's a personal GOAL. If I COULD - I would work on data sets until they ARE perfect. However, in business, this is rarely allowed, because the amount of time it would take to create "perfect data" is prohibitively expensive. I might be able to get data quality up to about 80 percent using data tools, inside a database, in maybe four hours, or eight hours. But, that last 20 percent, where there is so much missing ,or so much requiring external validation, that 20 percent remainder will take me much, MUCH longer to do - probably about triple the amount spent on the 80 percent. So if it took me 8 hours to query-cleanse the data to 80 percent data quality, then it would take me 24 additional hours to do about 17 percent of the remaining 20 percent. The 3 percent after that, will almost certainly be REALLY prohibitively expensive for me to work on it, I would suggest it would be another 3 days just to clean up those last foul 3 percent of really messed up data. Not always, but often enough. So - it's something like this, then
First 80 percent - computer queries - 8 hours
Next 17 percent - computer queries coupled with manual and external validation - 24 additional hours
Next 3 percent - manual work only, to clean up and re-type missing information - 24 additional hours
Now, if you ask a manager which option they want, they will, without a doubt, pick option 1 - and, they will want it in three hours instead of 8.
A truly forward thinking manager, MIGHT, on a good day, allow me to work on something to the 97 percent level. This has happened ONCE in my recent memory, it's an absolute rarity, so, most of the data I've worked on, will be sitting at a pathetic 80 percent, because my boss would never, ever let me do the job right.
At no point, have I ever been allowed to go the whole way - except privately, if I was making a data collection to use as a reference table - then I would absolutely take it to 100 percent, and the amount of time it took - be damned.
The thing is, to do a really good job, you have to at least go to option two, you really do. 80 percent is possibly useful for some applications, it's certainly better than the percentage that the incoming raw data would have been at - but, in many instances, the kind of precision needed (think, rocket fuels or other explosives, organic chemicals, or other volatile chemical data) - for these types of data, it could by physically dangerous to trust anything LESS than 100 percent, so-called "perfect" data.
Because if you used 80 percent data to cook up explosives, you might get it wrong, and it might literally blow up in your face.
But - I would have my say - on those rare occasions where I have been ALLOWED to work through a database completely, adding a long, painful, difficult, manual editing session onto the end of a long, wearisome query based computer scan and repair operation, the resulting data sets have been so powerful, so useful, so reliable, so trustworthy, that in some cases, I was able to use them for YEARS before they required an update.
So the closer to perfect your data is, the longer it's useful; and, the more trustworthy and reliable it is, so...it's important not to "skimp" just for the sake of saving a few pounds or dollars of labour or labor. It's just almost stupid to stop when you are at 80 percent, because the next increment will almost always take you to about 97 percent, or at least, somewhere in the mid to high 90s, and that may be good enough.
In some fields though, it won't be, and for my money, I would just take the time to fix the data to 100 percent ONE TIME - and then, MAINTAIN it, update it EACH WEEK, so that it never goes out of date, it never falters, and it stays effectively at 99 - 100 percent for a long, long time.
If you do it to 80 percent, in a few weeks, you are going to have to do the whole thing again, because so much will be wrong, and it will be creating problems for you.
If you do it to 97 percent, this effect is lessened, but, problems can and do emerge, which can mean needing to rebuild the database
If you do it to 100 percent, you can trust it for weeks, maybe months, without spending ANY TIME or money on it.
One way or another, you are going to invest the 8 hours, the 24 hours, and the other 24 hours, into this dataset.
You could do that by repeatedly, often, rebuilding the data set to 80 percent, but - having to do that quite often, and taking another 8 hours to do so.
You could do it to 97, and that would mean, you could trust it longer, and leave rebuilding it longer - but, you will still need to do so, and that will mean another 32 hours to do so.
Or - you could do it ONCE, to 100 percent, bite the bullet, spend the 56 hours ONE TIME - and not have to touch or worry about the data for a year or two.
It's up to you. You don't SAVE money by spending less time sorting it out, the more of a mess you leave it in, the MORE OFTEN you will have to cleanse it. And you will end up spending more, because you were UNWILLING to do the job right, once, one time, to obtain a pristine data set that is so trustworthy, that you needn't worry about spending a moment on it for months or years - you can run queries to confirm this, but they will NOT return problems - over time, it will need to be occasionally updated, but if you give it regular maintenance, which does not take that much time - you will have a super high quality, renewable database or collection, that can support not just your ordinary work, but also, some of these incredibly creative data re-use projects that we've been talking about.
I've been both a staff member, and a manager, and I've spent the actual time taking data to 80 percent, 97 percent or 100 percent. And I just know, that for those rare, rare collections where I've taken it close to 100 percent, that those are the best, and the most useful, and the least costly to MAINTAIN, whereas the 97s and the 80s, are much more high maintenance, and you will spend more time keeping them up to their relatively poor level of quality, than you will spend on a pristine, perfect database.
Also - perfection is just an internal goal for a data analyst. It's an internal challenge - can I take this sow's ear, and turn it into a silk purse? How quickly can I do that? Can I make an 80 percent silk purse? A 97 percent silk purse? Or - a 100 percent silk purse that requires NO maintenance for a long, long time?
It's just a self challenge, it does often come across as if the data analyst "expects it to be perfect" therefore, the analyst is often disregarded and considered not to be a realist, or called an "academic" or not considered to be a practical member of staff. This is simply unfair, I've reached the place I am at by understanding the DETAILS of databases, by doing countless, hours-long manual validations, and learning how to make the best quality data sets imaginable.
So I would say, based on the maintenance angle alone, that it actually is not smart, to produce poor quality or even "80 percent" quality data, because the number of remaining errors in it, could cost you thousands - they could be deadly to your business. Spend a little more upfront - SPEND TO SAVE - if you will - and you will find that you have a really clean, really powerful reference collection of data that is maintenance-free for a year or two, and can be used equally easily to inform current business projects (because we can trust it to such a high level) or indeed, creative and powerful data re-use projects (because we can trust it to such a high level). By going to 100 percent, you pretty much have eliminated the possibility of error, so, theoretically, any projects informed by this perfect data set, should succeed WILDLY, with risk of problems caused by data effectively at zero.
I would not say the same for either the 80 percent of the 97 percent models, three percent of data can still hold a massive number of cruelly horrifically WRONG information, which, if accidentally picked up as "truth", could be financially or even physically devastating to your organisation.
So please, I am now asking you, on behalf of all data analysts everywhere, please don't pre-judge us, please consider that by allowing us to build a few 100 percent high quality data collections, that you are dramatically reducing maintenance costs whilst simultaneously increasing data trustworthy-ness and reliability to a massive, never before achieved high level.
I know which one I prefer, but that's because I like not having to do a lot of data maintenance. Invariably, however, I am often asked to work with data that can be as bad as 20 or 30 percent, or maybe I am lucky one day, and I get some 50 percent data - data like this, ALWAYS eats up time far worse than clean data does; and getting this kind of data "ship shape" can be significantly more difficult depending on your own situation - and, the maintenance of lower grade information is a nightmare, it's always needing work, just to keep it up to the crppy 60 percent it started out life at.
This is why, in my opinion, it would be better to take that 60 percent data, and fix it, ONE TIME, just ONCE, properly - then - the problems are ALL gone, the maintenance is almost all gone, it's dramatically reduced, and, the trustworthy-ness is through the roof. The data is RELIABLE.
So yes - with this explanation, yes, I do want data to be perfect. Because it actually SAVES money, even though it means a large hit in time up front - the more ugly data sets that you allow me to beautify, the more your maintenance is reduced, and the more your reliability goes up and up and up.
I know how I would choose now.
Anyway, this is my opinion, I do feel very...annoyed when I see something that intimates that the search for perfection is some how damaging or a bad idea - in this case, it's actually the BEST idea, based on the reasons I've expressed above. It's better to fix it ONCE, properly, and then maintain it well, with minimum effort I might add, from then on.
Otherwise - you will eat up far more time, working on poorer quality data, which takes many more hours to work on because it's so, so broken. I don't see how anyone can use data like that - unless there is no one or no process available to fix it, and you have no choice - but to me, having the cleanest, best data makes the best sense, because you are beating all of the bad experiences,, and avoiding a lot of extra maintenance, by fixing your data sensibly, up front, spending to save - which is often the best way in many parts of the business world - is it not?
I hope this viewpoint is a new one to you - I have observed this happening so many times at so many different organisations, and I have never, until this moment, come out and said what I feel about it - so I hope this is helpful - please feel free to challenge back, but for me, seeking "perfection" is the only thing that makes sense, unless I want to forever be doing major rebuilds of the same crppy databases over and over and over and over again every few months.
My .03 pence.
Thank you oh patient ones,
Data And Technical Standards Officer
Back O'Hill Industrial Estate
Back O'Hill Road
Stirling FK8 1SH
staffordd at stirling.gov.uk<mailto:staffordd at stirling.gov.uk>
>>> <ok-scotland-request at lists.okfn.org<mailto:ok-scotland-request at lists.okfn.org>> 21/11/2014 12:00 >>>
Send ok-scotland mailing list submissions to
ok-scotland at lists.okfn.org<mailto:ok-scotland at lists.okfn.org>
To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to
ok-scotland-request at lists.okfn.org<mailto:ok-scotland-request at lists.okfn.org>
You can reach the person managing the list at
ok-scotland-owner at lists.okfn.org<mailto:ok-scotland-owner at lists.okfn.org>
When replying, please edit your Subject line so it is more specific
than "Re: Contents of ok-scotland digest..."
This email was scanned by the Government Secure Intranet anti-virus service supplied by Vodafone in partnership with Symantec. (CCTM Certificate Number 2009/09/0052.) In case of problems, please call your organisations IT Helpdesk.
Communications via the GSi may be automatically logged, monitored and/or recorded for legal purposes.
This email has been received from an external party and
has been swept for the presence of computer viruses.
This email and any attachments are intended solely for the individual or
organisation to which they are addressed and may be confidential and/or
legally privileged. If you have received this email in error please
forward it to servicedesk at stirling.gsx.gov.uk<mailto:servicedesk at stirling.gsx.gov.uk> and then delete it.
Please check this email and any attachments for the presence of viruses
as Stirling Council accepts no liability for any harm caused to the
addressees' systems or data. Stirling Council may monitor its email system.
Stirling Council accepts no liability for personal emails.
This e-mail (and any files or other attachments transmitted with it) is intended solely for the attention of the addressee(s). Unauthorised use, disclosure, storage, copying or distribution of any part of this e-mail is not permitted. If you are not the intended recipient please destroy the email, remove any copies from your system and inform the sender immediately by return.
Communications with the Scottish Government may be monitored or recorded in order to secure the effective operation of the system and for other lawful purposes. The views or opinions contained within this e-mail may not necessarily reflect those of the Scottish Government.
Tha am post-d seo (agus faidhle neo ceanglan còmhla ris) dhan neach neo luchd-ainmichte a-mhàin. Chan eil e ceadaichte a chleachdadh ann an dòigh sam bith, a’ toirt a-steach còraichean, foillseachadh neo sgaoileadh, gun chead. Ma ’s e is gun d’fhuair sibh seo le gun fhiosd’, bu choir cur às dhan phost-d agus lethbhreac sam bith air an t-siostam agaibh, leig fios chun neach a sgaoil am post-d gun dàil.
Dh’fhaodadh gum bi teachdaireachd sam bith bho Riaghaltas na h-Alba air a chlàradh neo air a sgrùdadh airson dearbhadh gu bheil an siostam ag obair gu h-èifeachdach neo airson adhbhar laghail eile. Dh’fhaodadh nach eil beachdan anns a’ phost-d seo co-ionann ri beachdan Riaghaltas na h-Alba.
The original of this email was scanned for viruses by the Government Secure Intranet virus scanning service supplied by Vodafone in partnership with Symantec. (CCTM Certificate Number 2009/09/0052.) This email has been certified virus free.
Communications via the GSi may be automatically logged, monitored and/or recorded for legal purposes.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ok-scotland