[okfn-help] Fwd: what's a dataset? -- Re: Public Data Catalog Priorities and Demand

Jonathan Gray jonathan.gray at okfn.org
Tue Dec 22 17:44:43 GMT 2009


Noticed this thread on UK Gov Developers list - and wondered whether
it would be worth pitching in with our thinking on this from
perspective of CKAN work?

J.

---------- Forwarded message ----------
From: David Pullinger <David.Pullinger at coi.gsi.gov.uk>
Date: Tue, Dec 22, 2009 at 3:31 PM
Subject: Re: what's a dataset? -- Re: Public Data Catalog Priorities and Demand
To: Niemann.Brand at epamail.epa.gov, Jose Manuel Alonso
<josema.alonso at fundacionctic.org>
Cc: Joe Carmel <joe.carmel at comcast.net>, Steven Clift
<clift at e-democracy.org>, Antti Poikola <antti.poikola at gmail.com>,
chris-beer at grapevine.net.au, sunlightlabs at groups.google.com, Suzanne'
'Acar <suzanne.acar at ic.fbi.gov>, Jonathan Gray
<jonathan.gray at okfn.org>, public-egov-ig at w3.org


Brand,

I found that helpful.  From my experience at seeing statistical data
being gathered together into coherent datasets, I was musing on a
description that goes along these lines, expanding up from a single
datapoint through larger units:

- data  (e.g. 24)
- metadata about that data (e.g. number of additional deaths in
England due to sub-zero temperatures in December 2009)
- bibliographic data (e.g. published January 2010, Office for Health
Statistics, subject)
- contextual metadata (author, contact, etc.)
- dataset (set of items like those above, e.g. deaths due to
environmental causes, )
- bibliographic data on dataset (dataset published, organisation, etc.)
- contextual metadata dataset (author, contact, for dataset)

...with that data and metadata, of course, being structured in RDF(a)
or some equivalent.
Building on this, the following is a prompt list I drew up to help
people across government identify data that could be usefully put into
re-usable form for third parties:


Types of data that might be helpfully sought out (bearing in mind
information and data might fit into a number of these types):



A  Lists – especially where these are reference lists (i.e. that are
used by others as source lists)

Examples:  Ministers, Government Departments, Public Dodies, Regions,
Local government bodies, hospitals, schools, courts,  dogs classed as
dangerous



B  Point data regularly issued (time series)

Examples:  Average class size, hospital waiting lists, Gross Domestic
Product, violent crime, public service performance, environment
quality,



C Policy information

Examples:  tax bands, benefit determination criteria



D  Datasets collected at one point in time

Examples, Population census, surveys, research



E Information containing data of interest that is regularly published

Examples:  Statutory notices, job vacancies, consultations,
legislation, contractual opportunities, press releases



F Data associated with location (i.e. any information with a
geographical location, whether or not they also have other dimensions
such as time)

Examples:  traffic information, roadworks, Ministerial visits,
planning applications, locations of public transport (trains, buses,
trams, ferries, etc), address files (non-personal).


Some of these would have a 'dataset' that is a time series, others a
list etc. I agree the key is having an ontology that relates the
different parts of a dataset in the way that Brand describes.

Best seasonal greetings,

David

David Pullinger
david.pullinger at coi.gsi.gov.uk
Head of Digital Policy
Central Office of Information
Hercules House
7 Hercules Road
London SE1 7DU
020 7261 8513
07788 872321

Twitter #digigov and blogs:  www.coi.gov.uk/blogs/digigov

>>> <Niemann.Brand at epamail.epa.gov> 22/12/2009 13:56 >>>
Jose, This is the way I look at this - well-constructed data tables
consist of a combination of data elements that subject matter experts
/ statisticians agree make sense together (not apples and oranges as
we say) and databases consist of multiple data tables that make sense
together, even better have an ontology that relates them and all their
data elements.
This is what I have been recommending for Data.gov for some time now.
Best wishes for the holiday season. Brand
-----public-egov-ig-request at w3.org wrote: -----

To: chris-beer at grapevine.net.au
From: Jose Manuel Alonso <josema.alonso at fundacionctic.org>
Sent by: public-egov-ig-request at w3.org
Date: 12/21/2009 01:33PM
cc: "Antti Poikola" <antti.poikola at gmail.com>, "Joe Carmel"
<joe.carmel at comcast.net>, "'Jonathan Gray'" <jonathan.gray at okfn.org>,
"'Steven Clift'" <clift at e-democracy.org>, public-egov-ig at w3.org,
sunlightlabs at groups.google.com, "'Acar, Suzanne'"
<suzanne.acar at ic.fbi.gov>
Subject: what's a dataset? -- Re: Public Data Catalog Priorities and Demand

>> * If there is, let's say some thousand, datasets in data.gov, is
>> there
>> any analysis or wild guesses of how many is missing 10 000, 50 000,
>> 100
>> 000, 500 000?
>
> I'd say most, but that there is probably not as many as you'd think.
> And
> I'm including data.gov.* in that. (We have to as a group remain
> international in focus :) ) Many seem to include "views" of data in
> the
> term "missing datasets" - I think that if one could identify what
> datasets
> are primary (something I'll expand on once I find what you meant
> above),
> then we could generate a lot of other datasets from these. I guess
> what
> I'm saying is its probably just as important to ask how many
> datasets out
> there
> are dependent on other datasets for their data.

Ok, so I told you this one deserved it's own separate message. This is
something we've been discussing at CTIC for quite a while: what is a
dataset, how would you define it? How would you count how many you've
published?

Is the "2005 Toxics Release Inventory data for the state of Alaska"
one dataset?
Is "Toxics Release Inventory data for the state of Alaska" one dataset?
Is "Toxics Release Inventory data for all the states" one dataset?

If all of the above are datasets (even if not), how many is data.gov
publishing?

In one of the projects I'm currently involved in, the government is
about to publish information about all the public buildings. Is this
one dataset?

What if the government publishes just the information of the public
schools? One dataset?
Then, the one about hospitals... one dataset?
But this two types of buildings (and several other types) are part of
the big dataset, so is this really a dataset or a subset of the big
one? How may should I count? One? Three?

Unfortunately, I believe I don't have a good answer. I tried for a
long while, telling myself a dataset should be anything that is
meaningful as a separate entity and that datasets can be combined into
super-datasets. Example: public schools is one dataset, hospitals is
another one, public buildings is another one, but are those three
datasets? hmm... maybe we should only count the smaller ones?

What if instead of hospitals, we talk about "healthcare related
centers" such as: hospitals, ER, GPs, Dentists, Pharmacies, Opticians
(taken from NHS.UK). Hey, we have now six datasets? Or just a big one
and the six smaller ones are just "a class of" the big one...

Btw, does the number really matter? Or should we just better catalog
in terms of knowledge areas?

Unless we (at large) can agree on what is a dataset and how they
should be counted, I believe talking about numbers has no much sense.

Let the discussion (go on) begin... :)

-- Jose




This communication is confidential and copyright.
Anyone coming into unauthorised possession of it should disregard its
content and erase it from their records.

The original of this email was scanned for viruses by Government
Secure Intranet (GSi) virus scanning service supplied exclusively by
Cable & Wireless in partnership with MessageLabs.
On leaving the GSI this email was certified virus free.
The MessageLabs Anti Virus Service is the first managed service to
achieve the CSIA Claims Tested Mark (CCTM Certificate Number
2006/04/0007), the UK Government quality mark initiative for
information security products and services. For more information about
this please visit www.cctmark.gov.uk



-- 
Jonathan Gray

Community Coordinator
The Open Knowledge Foundation
http://www.okfn.org



More information about the okfn-help mailing list