[ckan-discuss] Fwd: [uk-government-data-developers] Data Dumps at source.data.gov.uk
jonathan.gray at okfn.org
Tue Aug 24 14:55:30 BST 2010
---------- Forwarded message ----------
From: Leigh Dodds <leigh.dodds at talis.com>
Date: Tuesday, August 24, 2010
Subject: [uk-government-data-developers] Data Dumps at source.data.gov.uk
<uk-government-data-developers at googlegroups.com>
I've just put together an initial set of data dumps for the majority
of the Linked Data currently being published by data.gov.uk. More
information on what's not included and why in a moment.
(Disclaimer: what follows is my understanding of the current state of
play, so any errors/omissions then blame me :)
There is a server at http://source.data.gov.uk which has been set up
to provide access to both data dumps and (eventually) the code used to
generate/convert the data. The data dumps can be found at:
The intention is to create a repository of versioned datasets that
will allow anyone to mirror the data for their own use/purposes, e.g.
to perform local analysis or to host in your own triple store. Over
time this repository should become a complete archival copy of all of
the Linked Data that is published through data.gov.uk, complete with
information on the provenance of individual datasets.
The team behind data.gov.uk are still working through a number of the
best practices, so right now I've simply put up copies of all the
currently live datasets.
HOW THE DATA IS ORGANISED
The web archive is organised into a series of sub-directories:
* Sector — top-level sector. E.g. as used in *.data.gov.uk
* Dataset — dataset directory, a short identifier for the dataset.
I've made some of these up at present
* Date-stamped directory — in format of yyyy-mm-dd.
* Data files — This may be an number of data files in different
formats. E.g the data may span a number of small files, some files may
be ntriples for loading into default graph and some files may be
For example, the RDF version of Edubase currently available from
http://education.data.gov.uk can be found here:
with the general pattern being:
Currently only the latest versions of each dataset are being loaded
into the live SPARQL endpoints, but over time there will be a move
towards using named graphs for versioning (as described at ).
LINKED DATA, DATA DUMPS & SERVICES
The sector identifier ties together the Linked Data, the data dumps,
and the SPARQL endpoints and other services. For example if you're
looking at some Linked Data, e.g.:
Then this data will be included in the SPARQL endpoint at:
The search interface at:
And the raw data can be found in one (or more) of the datasets accessible from:
WHAT IS NOT INCLUDED?
As I explained at that start of this email, not all of the Linked Data
being published from data.gov.uk, or the UK government is currently
represented in these data dumps.
The RDF available from the legislation.gov.uk is currently only
available as Linked Data because its surfaced directly from the
website. Ditto, that published from the London Gazette website as
RDFa. It would be possible to regularly crawl and dump those sources,
but I'm not sure if there are plans to do that yet. Other departments
and projects may also surface their own data and data dumps.
The other dataset that is not represented in the dump are the
date-time URIs available from reference.data.gov.uk, e.g. . as
these are all algorithmically generated. I don't recommend anyone
crawls those :)
Any questions then please ask.
Programme Manager, Talis Platform
leigh.dodds at talis.com
The Open Knowledge Foundation
More information about the ckan-discuss