[ok-scotland] Dave Stafford, Stirling Council - Scottish Government Open Data Strategy: Additional considerations - rating system to identify what processes data has been through

David Stafford staffordd at stirling.gov.uk
Thu Nov 20 08:45:05 UTC 2014

Absolutely, Claudia, you have stated this so much more elegantly than I did - thank you for this.
"Accuracy, quality, up to date, reliable, trustworthy data sets" - that is what we need, but, it's easy to say, not easy to actually do.
Yes, of course, as Claudia said, there will need to be a healthy dose of realism, as far as what really can be done, and some data sets will refuse all attempts to improve their quality - therefore, it is absolutely imperative that our metadata does "rate" each data set in terms of content, quantity, quality - we need to clearly know how reliable, accurate, trustworthy and up to date our information is.  
Metadata is key - the more we collect, the more we attach - the more ways we can query our metadata, the more diverse search criteria we can apply, and so on.  We need to be able to see, at a glance, whether the data is of the highest quality, 100 percent checked, every row and column, or if it's a data set checked only to a lower percentage or level.  Is it a "ten", or a "ten with enhancements", or is it a "4", or maybe, an "8"??  We need a reliable way to KNOW this, when we go to use a data set for ANY purpose.  And the public need to know this to, since they may be basing work on the published data sets, so the accuracy and up to date status in particular, are crucial to developers or other folk trying to work with the data sets.
I like the idea of including "how much processing has been done" - and also, specifically, WHAT processing has been done.
If, for example, we could establish some sort of "standard" process steps, i.e. something like
1) Removed superfluous header rows, empty rows, and unnecessary characters - initial file clean up 
2) Data imported from it's native state (which was -format name here-) into a database table for further cleansing
3) Data has now been formatted into clean, precise rows and columns
4) Other additional cleansing or changes to the original data format
5) Other additional parsing, formatting, and data element inspections - query to determine initial state of each column of information, note normalisation problems, etc.
6) Basic normalisation has been performed on the information, missing numeric values (zeros) have been restored, etc.
7) Extended normalisation has been performed on data values, normalising terms that are misspelt, abbreviated, truncated, etc.***
8) Basic validation queries have been run to check for accuracy, to check that the information is up to date, to check data reliability and trustworthyness
9) Extensive specialised validation queries have been run to check for as many data anomalies as possible, and all outstanding anomalies have been corrected
10) Additional external validations have been performed to verify beyond all doubt that the information is accurate, up to date, reliable and trustworthy - phone calls, research, independent re-checking of all maths, calculations, etc. possible, queries to look for non-normalised values - anything and everything to bring it up to 80%
11) Every row and column has been validated, re-checked, and checked again - 90 percent of all data has been verified - full external validation and query validation
12) Every row and column has been validated, re-checked, and checked again - 100 percent of all data has been verified - full external validation, query validation, AND full manual validation - full manual validation means every cell, every value in every row and column, has been visually checked by an experienced data analyst, on top of the series of powerful select queries that search out and find every possible anomaly in the data set.
13) Data has been enriched by updating an empty column from related data set (-data set ID here-) to increase it's value, usability, or facility
14) Other additional value-adds or enhancements made to the data, either from charts of standard measurements, or other "lists" or standards, or any identified, reliable data source - enhancements absolutely need to be noted if they are present in a "finished" data set.  We may need to make such data sets available in "two flavours" - with and without the enhancements.  Some folk may need just the original dataset, not the additional enhanced data elements.
***  Regarding Step 7, Extended Normalisation, above: often in data, we see many, many different representations of the same concept, here is a very common example taken from the world of Human Resources, a position name that has such an extraordinary number of different "representations" or "versions":
Real term: Support For Learning Assistant
Data values found in a typical file:
Support For Learning Assistant
Support For Learning Asst.
Support For Learning Asst
Support For Learn. Assistant
Support For Learn Assistant
Supp. For Learning Assistant
Supp For Learning Assistant
Supp. For Learn. Asst.
Supp For Learn Asst
Supp For Learn Assistant
S. L. A.
...the possible permutations are nearly endless, that's just a few of them from memory - so in this step, we take all of those variants, and do an update query that changes them all to "Support For Learning Assistant".  That's "normalisation" at it's most basic.
The above "steps" are totally off the top of my head, and as it's very early in the morning, I have absolutely missed out several here, especially in the initial cleansing stages, but you get the idea.  We could say that a data set has been prepared "45 percent", which would mean that it's been checked from steps one through seven above.  A "100 percent" data set would be one that is essentially perfect, where we have done steps 1 through 12 (omitting number 11, possibly - we would have to have exceptions, as some steps might well be skipped - to go to a higher level of quality - you jump up to 11 get a 90 percent rating, or jump up to 12 to get a 100 percent rating.)   Omitted steps must ABSOLUTELY be noted, so if we did NOT externally validate the information against any external sources, then that needs to be noted.
"100 Percent Enhanced" - From Data Set -data set name here- would be items that have gone through steps 13 and / or 14 - enhancing the data by ADDING other data elements to it, from reliable data sources of many, many kinds.
We need some kind of overall number that means "accuracy, quality, up to date, reliable, trustworthy" - it could be a number rating from 1 to 10, or a percentage, or a descriptive rating - but we need to fully understand just what the "accuracy, quality, up to date, reliable, trustworthy" values for each data set are, so we can use it with, or without, confidence in it's "accuracy, quality, up to date, reliable, trustworthy" state.
Claudia - thank you, this is excellent.  Kate's questions were most excellent, and I felt they deserved a proper answer - but my responses should only be the spark for a larger debate - and your contribution here is most helpful - thanks.  Please - everyone, please join in, we need to get as many comments to Ewan as possible, so he has time (before November 30th - a mere ten days away!!!) to get them collated into a sensible form, and submitted to the Group.
Apologies for the extensive data analyst talk. :-)

Dave Stafford
Data And Technical Standards Officer
ICT Implementation
Unit 12
Back O'Hill Industrial Estate
Back O'Hill Road
Stirling FK8 1SH
staffordd at stirling.gov.uk
01786 233986

>>> Claudia Pagliari <Claudia.Pagliari at ed.ac.uk> 19/11/2014 16:18 >>>
Hello Dave,
Thanks for helpfully raising these important issues, and to Kate for her questions.  We absolutely do need to encourage the consistent application of agreed data processing methods if we are to generate valid and usable information sources for secondary analysis.  This is also a major challenge facing the new Administrative Data Research Centres, which are trying to harvest and link various pseudo-identified (not necessarily open) data for academic and policy research in a very mixed economy of structured and unstructured data types of hugely variable quality.  Documenting the process of data transformation is vital (your point 4) and it would be good to see your version-tracking methodology.  The Information Interoperability Standards are obviously going to be essential for lining up our national data ducks.  Clearly there are additional challenges when considering the value of richer linked datasets.   I think one of our key challenges in the data science community is to be realistic about what can or cannot be made available for secondary uses, for different levels of investment in data cleaning/processing. So any national Information Assets Registers will need to classify available assets according to their quality, usability and accessibility and provide relevant meta-data indicating how much processing has already been done. Pretty much what you have said.
Best wishes, 

On 19 Nov 2014, at 12:11, David Stafford <staffordd at stirling.gov.uk> wrote:

Hello Kate,
I am Dave Stafford, the Data & Technical Standards Officer at Stirling Council, and I very much want to have a go at answering your question number 2, and possibly, take a quick look at some of your other questions - these are all very thought-provoking, and we really do need to have a serious dialogue about all of these issues.
Kate - thank you for your posting, you raise some very good questions, and I think these are the beginnings of a number of questions and concerns that we will need to look at, decide, quite sharpish, if we are to meet the deadline for submitting our suggestions and comments.
I think no. 1 is easily taken care of - Ewan's reply covers it quite well, but it should definitely be spelled out somewhere.  We need to be quite precise with terminology, which takes me to...
2 - an excellent question, Kate - "what does "data" cover", a question I've been answering (or mostly, trying to answer, is a more accurate description) for a decade and a half; I think it's actually a broader question, we need to define clearly, in the spec, the following:
   a) data
		    i) raw data
		    ii) "ordinary" data
   b) metadata
   c) interoperability standards
   d) information
   e) how images are to be handled
   f) how raw data that refuses to be converted is to be handled
...and probably many, many others.  One explanation for the question "what does 'data' cover" I have heard makes reasonable sense to me:
We start with raw data, which is "as received" - this can be anything from a very precise database table, to a very badly formatted Word or Powerpoint document - to an image - to just about anything - in any format, with any number of data problems, so that means, that the first step with raw data is this:
1) Convert Raw Data into Useable Data
This would be one of the things I've specialised in over the years; taking a huge, widely disparate range of incoming input raw data, and transforming it, using tools that help to standardise, normalise and improve data quality, until it becomes ordinary data - data that is in strict columns and rows, that can and should be stored as a database table, data that is organised, preferably, classified or categorised somehow, and, can be queried and when queried, produces answers that are concise, accurate, and very useful.  
2) Then - we take the Useable Data, and finally, convert it into "Information".  Some people don't think this is a valid step, and I see their point - really, to my mind, "useable data" and "information" could be one and the same, but for the sake of argument, let's keep rolling with the "Information" version of this exercise:
3) Information, which is derived from your Useable Data sources (which in turn, were derived from unprocessed, raw data source) is now configured, perhaps selectively, maybe we only wish to use part of the Useable Data that is about to become Information, and we will need to format that output, preferably to a 5, but to start, to at least a 3.
4) Part of the process of converting Useable or "ordinary data" (basically, data that has been "cleaned up" or "cleansed", using data cleansing tools) has to be, somehow, attaching metadata to your newly created and formatted "Information".  What form this takes - needs to be agreed.   One mandatory element should absolutely be a Version Change Log; every Information Asset should have a properly formatted and constantly updated "Version Change Log", that is physically attached to the item if possible, or, stored in a global "Metadata Database For Information Assets" at the very least, which should become an integral part of your Information Asset Register. 
I personally have a set of Metadata fields that I have constructed over time, by cherry-picking the very best of various standards, along with my own wishes for very detailed metadata that will make finding and understanding what the content of our information is, very, very easy. I've also designed my own Version Change Log, and I have built it directly into the Information Asset Register, so that if an incoming asset has no existing change information, at least, going forward from the moment it is entered into the IAR, it will have Version Change Information - for the future.
But there is no one "Metadata" standard that we have agreed (or is there?) and if we haven't - we should.  Along with "ordinary" metadata...we now have this additional burden placed on us, our information needs to contain the agreed Interoperability Standards - please see below:
5) A recent innovation, special Interoperability Standards, have recently been published by the Scottish Government, and I am strongly recommending that these three new standards should be embedded in the finished "Information" that we are publishing into the National Grid.  Indeed, you have to use these new standards, if you expect your data to be query-able alongside the data of the other 31 Councils, so Information on a "National" level can be queried by the greater Scottish public; along with your own local, Council information that makes up part of the 32 National data sets.
For the record, those standards are (per Peter Winstanley, Scottish Government):

So each piece of Information you create, will need ordinary metadata (still undefined, I am afraid), and, these three standard additional metadata operators as well, that allow your data to be queried "nationally" - without these, your Information will be considered to be "ad hoc", which is unacceptable - your Information (which, by the way, you should have fully categorised, standardised, with Interoperability Standards already attached, contained in a properly formatted "Information Asset Register") - this should be the data container (your beautiful new Information Asset Register) from which you pull your "data sets" - and they would basically be fully converted, formatted Information including all requisite metadata and interoperability information - which is ready to be integrated into the National data sets, and can then be fully query-able by anyone, in conjunction with the data sets published by the other 31 Councils.
And that's just the start.
I know that Tabitha Stringer is promising us a resource pack or toolkit, which of course, I welcome, very, very much; however, we need to be very hyper-aware of everything I have noted above; I've been having to prepare for this myself, and what I have written above, is actually just the bare bones, most minimal version of the requirement - basically, the data we hold now is all ad hoc, and it's not yet stored in an Information Asset Register (I have built it, but we have not yet begun populating it) - so we have a LOT of work to do, as Councils, to get our ad hoc data, some of which is in a pretty "raw" state, some of which will be partially or completely cleansed, and now can be called "ordinary" or usable data, transformed - literally, transformed, into the exact data set, following the right criteria, and including all of the metadata and standards that will be mandatory sooner than you might imagine, all of our data is going to have to become cleansed, powerful, high quality information, and the old days of ad hoc, just throw it into a spreadsheet, are long, long gone - we have so, so much work to do.
However - maybe I have painted a dark picture here, it's just work, it's not that hard to do - but it takes work, it takes time, and it takes a discriminating eye to be able to "see" that a data set has been created from pure, clean properly attributed data, data that comes from an Information Asset Register, and has all the accoutrements needed, all the metadata attributes needed, to qualify for the National Grid.  To be able to "see" a five, and to be able to determine when any data set is missing information, or is otherwise not meeting the spec - and must then be downgraded to a 4 or even a 3 until it's problems are sorted out.
Please discuss - these points are all critical to the success of this, we need the resource pack / toolkit, yes, but we also need to understand the new Standards to which our Information is going to be held - and in my case, I realise, that I very probably do not have a single data set that comes even CLOSE to meeting the new Standards, and if I don't light a fire both under myself, and under my colleagues, we are going to get caught short, with a big pile of basically, disorganised raw data, and have no methods or processes in places, to transform into beautiful, high quality information, which is now expected as the norm.
3. I must profess, licensing is not my area, I hope someone else can answer Kate's query here???
4. Aiming for 3 does seem like it's pragmatic, but I utterly agree with Kate - in my opinion, we just about have the Strategy ready, it will be published soon, and along with other information about "how to transform your data into useful, non-ad-hoc information that qualifies to become a "5" - the toolkit, and advice from each other, too - I say, why not just configure your Information Asset Register to ONLY preparing data sets as "5s" and only accept something lower when there is no easy way to bring a certain data set up to a five?  I would make the DEFAULT setting for data, on input - to be 'make it a "5", store it as a "5", export it as part of a data set as a "5" ' - then, you have NOTHING to ever worry about.  I don't reckon it's THAT MUCH extra effort to bring a data set up from a 3, to a 5.  We have tools, we have knowledge, we are developing processes - so - let's not shoot for the lower target, the easy way - let's do it right, and have the cleanest, most high quality data imaginable in our Information Asset Registers.  Start with the best, and only produce data sets with 4s or 3s if we absolutely must, if there is no way to make the dataset a 5 without a massive undertaking.

What do you think?
And yes - it does look like "good stuff" - we just need to add what value we can, while we have the opportunity - and that moment is right now!
All the very, very best,
KATE'S POST from 20141118 22:12

Dave Stafford
Data And Technical Standards Officer
ICT Implementation
Unit 12
Back O'Hill Industrial Estate
Back O'Hill Road
Stirling FK8 1SH
staffordd at stirling.gov.uk
01786 233986
>>> Kate Byrne <k.byrne at ed.ac.uk> 18/11/2014 22:12 >>>
> Thanks for posting this Ewan - very interesting and welcome. I have a few queries or points for discussion:

1. Who is covered by "all organisations", para 6? Is this the whole of Scottish public sector, eg including universities, Health Boards, museums and archives...?
" It is proposed that by the end of 2015 all organisations should have completed and published an open data publication plan, setting out their commitment to make data open and identifying what data they will make open. "

2. What does "data" cover? Will organisations have a free choice in deciding what to make open? Where does data end and metadata begin? Eg, if I have a high resolution photo (in an archive, say), is that data in the sense used here? If I describe it with a database record is that data or metadata? Presumably if I describe the format of the database record that's metadata. I note that some of these questions are left open in the document and there are some suggestions about exceptions (with Registers of Scotland being explicitly mentioned, for some reason). Are these areas likely to be firmed up a bit in a later draft?

3. What's the logic behind choosing the Open Government Licence instead of the suite of CC licences?

4. Aiming for 3* data seems a pragmatic plan, but in a way it seems a pity that it won't be "living" data, ie directly accessible (eg via an API) from the working dataset, rather than a snapshot that gets refreshed every so often. Is having a single "Scottish Data Discovery Site" the best way forward, or is there a risk of it containing fossilised data, abstracted from its context?

Overall, it looks like good stuff to me.


On 17/11/14 14:39, Ewan Klein wrote:

Dear all,

the Scottish Government's Data Management Board has commissioned an open data working group to develop an Open Data Strategy for Scotland by December 2014. Information about the working group is available at:
The working group has produced a draft Open Data Strategy document:
The intention is to publish the final version of this document in December, so any thoughts or comments on the draft would be appreciated by the end of November. You can send these directly to Tabitha.stringer at scotland.gsi.gov.uk. Alternatively, we could have some discussion on this list, and I will compile and forward the discussion to Tabitha.

Best regards.


Ewan Klein
Open Knowledge Ambassador for Scotland
Skype:  ewan.h.klein |  @ewanhklein
Open Knowledgehttp://scot.okfn.org/  |  @okfnscot

ok-scotland mailing listok-scotland at lists.okfn.orghttps://lists.okfn.org/mailman/listinfo/ok-scotlandUnsubscribe: https://lists.okfn.org/mailman/options/ok-scotland
Kate Byrne
School of Informatics, University of Edinburghhttp://homepages.inf.ed.ac.uk/kbyrne3/location: http://geohash.org/gcvwr2rkb5hdtwitter: @katefbyrne

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

****************************************************************** This email and any attachments are intended solely for the individual or organisation to which they are addressed and may be confidential and/or legally privileged. If you have received this email in error please forward it to servicedesk at stirling.gsx.gov.uk and then delete it. Please check this email and any attachments for the presence of viruses as Stirling Council accepts no liability for any harm caused to the addressees' systems or data. Stirling Council may monitor its email system. Stirling Council accepts no liability for personal emails.
ok-scotland mailing list
ok-scotland at lists.okfn.org
Unsubscribe: https://lists.okfn.org/mailman/options/ok-scotland


This email and any attachments are intended solely for the individual or 
organisation to which they are addressed and may be confidential and/or 
legally privileged.  If you have received this email in error please 
forward it to servicedesk at stirling.gsx.gov.uk and then delete it.  
Please check this email and any attachments for the presence of viruses 
as Stirling Council accepts no liability for any harm caused to the 
addressees' systems or data.  Stirling Council may monitor its email system.  
Stirling Council accepts no liability for personal emails.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ok-scotland/attachments/20141120/4c3f1af2/attachment-0002.html>

More information about the ok-scotland mailing list