[ok-scotland] Dave Stafford, Stirling Council - Scottish Government Open Data Strategy: Draft document open for comment - raw data; data; information; metadata; interoperability

Claudia Pagliari Claudia.Pagliari at ed.ac.uk
Wed Nov 19 16:18:53 UTC 2014


Hello Dave,
Thanks for helpfully raising these important issues, and to Kate for her questions.  We absolutely do need to encourage the consistent application of agreed data processing methods if we are to generate valid and usable information sources for secondary analysis.  This is also a major challenge facing the new Administrative Data Research Centres, which are trying to harvest and link various pseudo-identified (not necessarily open) data for academic and policy research in a very mixed economy of structured and unstructured data types of hugely variable quality.  Documenting the process of data transformation is vital (your point 4) and it would be good to see your version-tracking methodology.  The Information Interoperability Standards are obviously going to be essential for lining up our national data ducks.  Clearly there are additional challenges when considering the value of richer linked datasets.   I think one of our key challenges in the data science community is to be realistic about what can or cannot be made available for secondary uses, for different levels of investment in data cleaning/processing. So any national Information Assets Registers will need to classify available assets according to their quality, usability and accessibility and provide relevant meta-data indicating how much processing has already been done. Pretty much what you have said.
Best wishes, 
Claudia

On 19 Nov 2014, at 12:11, David Stafford <staffordd at stirling.gov.uk> wrote:

> Hello Kate,
>  
> I am Dave Stafford, the Data & Technical Standards Officer at Stirling Council, and I very much want to have a go at answering your question number 2, and possibly, take a quick look at some of your other questions - these are all very thought-provoking, and we really do need to have a serious dialogue about all of these issues.
>  
>  
>  
> Kate - thank you for your posting, you raise some very good questions, and I think these are the beginnings of a number of questions and concerns that we will need to look at, decide, quite sharpish, if we are to meet the deadline for submitting our suggestions and comments.
>  
> I think no. 1 is easily taken care of - Ewan's reply covers it quite well, but it should definitely be spelled out somewhere.  We need to be quite precise with terminology, which takes me to...
>  
> 2 - an excellent question, Kate - "what does "data" cover", a question I've been answering (or mostly, trying to answer, is a more accurate description) for a decade and a half; I think it's actually a broader question, we need to define clearly, in the spec, the following:
>  
>    a) data
>             i) raw data
>             ii) "ordinary" data
>    b) metadata
>    c) interoperability standards
>    d) information
>    e) how images are to be handled
>    f) how raw data that refuses to be converted is to be handled
>  
> ...and probably many, many others.  One explanation for the question "what does 'data' cover" I have heard makes reasonable sense to me:
>  
> We start with raw data, which is "as received" - this can be anything from a very precise database table, to a very badly formatted Word or Powerpoint document - to an image - to just about anything - in any format, with any number of data problems, so that means, that the first step with raw data is this:
>  
> 1) Convert Raw Data into Useable Data
>  
> This would be one of the things I've specialised in over the years; taking a huge, widely disparate range of incoming input raw data, and transforming it, using tools that help to standardise, normalise and improve data quality, until it becomes ordinary data - data that is in strict columns and rows, that can and should be stored as a database table, data that is organised, preferably, classified or categorised somehow, and, can be queried and when queried, produces answers that are concise, accurate, and very useful. 
>  
> 2) Then - we take the Useable Data, and finally, convert it into "Information".  Some people don't think this is a valid step, and I see their point - really, to my mind, "useable data" and "information" could be one and the same, but for the sake of argument, let's keep rolling with the "Information" version of this exercise:
>  
> 3) Information, which is derived from your Useable Data sources (which in turn, were derived from unprocessed, raw data source) is now configured, perhaps selectively, maybe we only wish to use part of the Useable Data that is about to become Information, and we will need to format that output, preferably to a 5, but to start, to at least a 3.
>  
> 4) Part of the process of converting Useable or "ordinary data" (basically, data that has been "cleaned up" or "cleansed", using data cleansing tools) has to be, somehow, attaching metadata to your newly created and formatted "Information".  What form this takes - needs to be agreed.   One mandatory element should absolutely be a Version Change Log; every Information Asset should have a properly formatted and constantly updated "Version Change Log", that is physically attached to the item if possible, or, stored in a global "Metadata Database For Information Assets" at the very least, which should become an integral part of your Information Asset Register.
>  
> I personally have a set of Metadata fields that I have constructed over time, by cherry-picking the very best of various standards, along with my own wishes for very detailed metadata that will make finding and understanding what the content of our information is, very, very easy. I've also designed my own Version Change Log, and I have built it directly into the Information Asset Register, so that if an incoming asset has no existing change information, at least, going forward from the moment it is entered into the IAR, it will have Version Change Information - for the future.
>  
> But there is no one "Metadata" standard that we have agreed (or is there?) and if we haven't - we should.  Along with "ordinary" metadata...we now have this additional burden placed on us, our information needs to contain the agreed Interoperability Standards - please see below:
>  
> 5) A recent innovation, special Interoperability Standards, have recently been published by the Scottish Government, and I am strongly recommending that these three new standards should be embedded in the finished "Information" that we are publishing into the National Grid.  Indeed, you have to use these new standards, if you expect your data to be query-able alongside the data of the other 31 Councils, so Information on a "National" level can be queried by the greater Scottish public; along with your own local, Council information that makes up part of the 32 National data sets.
>  
> For the record, those standards are (per Peter Winstanley, Scottish Government):
>  
> 1) DCAT-AP - DATA APPLICATION PROFILE
> 2) ADMS - ASSET DESCRIPTION METADATA SCHEMA
>  
> 3) VoID - VOCABULARY OF INTERLINKED DATASETS
>  
>  
> So each piece of Information you create, will need ordinary metadata (still undefined, I am afraid), and, these three standard additional metadata operators as well, that allow your data to be queried "nationally" - without these, your Information will be considered to be "ad hoc", which is unacceptable - your Information (which, by the way, you should have fully categorised, standardised, with Interoperability Standards already attached, contained in a properly formatted "Information Asset Register") - this should be the data container (your beautiful new Information Asset Register) from which you pull your "data sets" - and they would basically be fully converted, formatted Information including all requisite metadata and interoperability information - which is ready to be integrated into the National data sets, and can then be fully query-able by anyone, in conjunction with the data sets published by the other 31 Councils.
>  
>  
> And that's just the start.
>  
> I know that Tabitha Stringer is promising us a resource pack or toolkit, which of course, I welcome, very, very much; however, we need to be very hyper-aware of everything I have noted above; I've been having to prepare for this myself, and what I have written above, is actually just the bare bones, most minimal version of the requirement - basically, the data we hold now is all ad hoc, and it's not yet stored in an Information Asset Register (I have built it, but we have not yet begun populating it) - so we have a LOT of work to do, as Councils, to get our ad hoc data, some of which is in a pretty "raw" state, some of which will be partially or completely cleansed, and now can be called "ordinary" or usable data, transformed - literally, transformed, into the exact data set, following the right criteria, and including all of the metadata and standards that will be mandatory sooner than you might imagine, all of our data is going to have to become cleansed, powerful, high quality information, and the old days of ad hoc, just throw it into a spreadsheet, are long, long gone - we have so, so much work to do.
>  
>  
> However - maybe I have painted a dark picture here, it's just work, it's not that hard to do - but it takes work, it takes time, and it takes a discriminating eye to be able to "see" that a data set has been created from pure, clean properly attributed data, data that comes from an Information Asset Register, and has all the accoutrements needed, all the metadata attributes needed, to qualify for the National Grid.  To be able to "see" a five, and to be able to determine when any data set is missing information, or is otherwise not meeting the spec - and must then be downgraded to a 4 or even a 3 until it's problems are sorted out.
>  
>  
> Questions?
>  
> Please discuss - these points are all critical to the success of this, we need the resource pack / toolkit, yes, but we also need to understand the new Standards to which our Information is going to be held - and in my case, I realise, that I very probably do not have a single data set that comes even CLOSE to meeting the new Standards, and if I don't light a fire both under myself, and under my colleagues, we are going to get caught short, with a big pile of basically, disorganised raw data, and have no methods or processes in places, to transform into beautiful, high quality information, which is now expected as the norm.
>  
> 3. I must profess, licensing is not my area, I hope someone else can answer Kate's query here???
>  
> 4. Aiming for 3 does seem like it's pragmatic, but I utterly agree with Kate - in my opinion, we just about have the Strategy ready, it will be published soon, and along with other information about "how to transform your data into useful, non-ad-hoc information that qualifies to become a "5" - the toolkit, and advice from each other, too - I say, why not just configure your Information Asset Register to ONLY preparing data sets as "5s" and only accept something lower when there is no easy way to bring a certain data set up to a five?  I would make the DEFAULT setting for data, on input - to be 'make it a "5", store it as a "5", export it as part of a data set as a "5" ' - then, you have NOTHING to ever worry about.  I don't reckon it's THAT MUCH extra effort to bring a data set up from a 3, to a 5.  We have tools, we have knowledge, we are developing processes - so - let's not shoot for the lower target, the easy way - let's do it right, and have the cleanest, most high quality data imaginable in our Information Asset Registers.  Start with the best, and only produce data sets with 4s or 3s if we absolutely must, if there is no way to make the dataset a 5 without a massive undertaking.
> 
> What do you think?
>  
>  
>  
> And yes - it does look like "good stuff" - we just need to add what value we can, while we have the opportunity - and that moment is right now!
>  
>  
>  
> All the very, very best,
>  
>  
> Dave
>  
>  
>  
> KATE'S POST from 20141118 22:12
> 
> 
>  
>  
>  
>  
>  
> Dave Stafford
> Data And Technical Standards Officer
> ICT Implementation
> Unit 12
> Back O'Hill Industrial Estate
> Back O'Hill Road
> Stirling FK8 1SH
>  
> staffordd at stirling.gov.uk
> 01786 233986
> >>> Kate Byrne <k.byrne at ed.ac.uk> 18/11/2014 22:12 >>>
> > Thanks for posting this Ewan - very interesting and welcome. I have a few queries or points for discussion:
> 
> 1. Who is covered by "all organisations", para 6? Is this the whole of Scottish public sector, eg including universities, Health Boards, museums and archives...?
> " It is proposed that by the end of 2015 all organisations should have completed and published an open data publication plan, setting out their commitment to make data open and identifying what data they will make open. "
> 
> 2. What does "data" cover? Will organisations have a free choice in deciding what to make open? Where does data end and metadata begin? Eg, if I have a high resolution photo (in an archive, say), is that data in the sense used here? If I describe it with a database record is that data or metadata? Presumably if I describe the format of the database record that's metadata. I note that some of these questions are left open in the document and there are some suggestions about exceptions (with Registers of Scotland being explicitly mentioned, for some reason). Are these areas likely to be firmed up a bit in a later draft?
> 
> 3. What's the logic behind choosing the Open Government Licence instead of the suite of CC licences?
> 
> 4. Aiming for 3* data seems a pragmatic plan, but in a way it seems a pity that it won't be "living" data, ie directly accessible (eg via an API) from the working dataset, rather than a snapshot that gets refreshed every so often. Is having a single "Scottish Data Discovery Site" the best way forward, or is there a risk of it containing fossilised data, abstracted from its context?
> 
> Overall, it looks like good stuff to me.
> 
> Kate
> 
> 
> On 17/11/14 14:39, Ewan Klein wrote:
>> Dear all,
>> 
>> the Scottish Government's Data Management Board has commissioned an open data working group to develop an Open Data Strategy for Scotland by December 2014. Information about the working group is available at:
>> 
>> http://www.scotland.gov.uk/Topics/Economy/digital/digitalservices/datamanagement/datainnovation/OpenDataStrategy
>> 
>> The working group has produced a draft Open Data Strategy document:
>> 
>> http://www.scotland.gov.uk/Topics/Economy/digital/digitalservices/datamanagement/MeetingsandPublications/DMBMeetingFive/DMBMeeting5Paper3
>> 
>> The intention is to publish the final version of this document in December, so any thoughts or comments on the draft would be appreciated by the end of November. You can send these directly to Tabitha.stringer at scotland.gsi.gov.uk. Alternatively, we could have some discussion on this list, and I will compile and forward the discussion to Tabitha.
>> 
>> 
>> Best regards.
>> 
>> Ewan
>> 
>> -------------
>> Ewan Klein
>> Open Knowledge Ambassador for Scotland
>> Skype:  ewan.h.klein |  @ewanhklein
>> Open Knowledge
>> http://scot.okfn.org/  |  @okfnscot
>> 
>> _______________________________________________
>> ok-scotland mailing list
>> ok-scotland at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/ok-scotland
>> Unsubscribe: https://lists.okfn.org/mailman/options/ok-scotland
>> 
> 
> -- 
> Kate Byrne
> School of Informatics, University of Edinburgh
> http://homepages.inf.ed.ac.uk/kbyrne3/
> location: http://geohash.org/gcvwr2rkb5hd
> twitter: @katefbyrne
> 
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
> 
> ******************************************************************
> 
> This email and any attachments are intended solely for the individual or 
> organisation to which they are addressed and may be confidential and/or 
> legally privileged.  If you have received this email in error please 
> forward it to servicedesk at stirling.gsx.gov.uk and then delete it.  
> Please check this email and any attachments for the presence of viruses 
> as Stirling Council accepts no liability for any harm caused to the 
> addressees' systems or data.  Stirling Council may monitor its email system.  
> Stirling Council accepts no liability for personal emails.
> _______________________________________________
> ok-scotland mailing list
> ok-scotland at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/ok-scotland
> Unsubscribe: https://lists.okfn.org/mailman/options/ok-scotland

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/ok-scotland/attachments/20141119/7eda8838/attachment-0003.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.okfn.org/pipermail/ok-scotland/attachments/20141119/7eda8838/attachment-0003.ksh>


More information about the ok-scotland mailing list