[okfn-discuss] Open Data Openness and Licensing
rufus.pollock at okfn.org
Mon Feb 2 13:12:57 UTC 2009
I've been meaning to put something down on 'open data' for some time.
The motivation for actually doing something now were recent
discussions with John Wilbanks and then the post of Michael Nielsen
and related thread on okfn-discuss. [nielsen-thread].
I'd really appreciate any comments people have. For those who prefer
reading websites rather than lists I've also posted the text up at:
Lastly there are two appendices for this essay which I've left out in
the interests of length. One is on 'Facts and Databases' the other is
'Comments on the Science Commons Protocol'. I'd be happy to send them
along if anyone wants to them.
# Open Data Openness and Licensing #
## Why does this matter?
Why bother about openness and licensing for data? After all they don't
matter in themselves: what we really care about are things like the
progress of human knowledge or the freedom to understand and share.
However, open data is crucial to progress on these more fundamental
items. It's crucial because open data is so much easier to break-up
and recombine, to use and reuse. We therefore want people to have
incentives to make their data open and for open data to be easily
usable and reusable -- i.e. for open data to form a 'commons'.
A good definition of openness acts as a standard that ensures
different open datasets are 'interoperable' and therefore do form a
commons. Licensing is important because it reduces uncertainty.
Without a license you don't know where you, as a user, stand: when are
you allowed to use this data? Are you allowed to give to others? To
distribute your own changes, etc?
Together, a definition of openness, plus a set of conformant licenses
deliver clarity and simplicity. Not only is interoperability ensured
but people can know at a glance, and without having to go through a
whole lot of legalese, what they are free to do. (For more see [this
article][why] and [this post][explicit]).
**Thus, licensing and definitions are important even though they are
only a small part of the overall picture. If we get them wrong they
will keep on getting in the way of everything else. If we get them
right we can stop worrying about them and focus our full energies on
Over the last couple of years there has been substantial discussion
about the licensing (or not) of (open) data and what 'open' should
mean. In this debate there two distinct, but related, strands:
1. Some people have argued that licensing is inappropriate (or
unnecessary) for data.
2. Disagreement about what 'open' should mean. Specifically: does
openness allow for attribution and share-alike 'requirements' or
should 'open' data mean 'public domain' data?
These points are related because arguments for the inappropriateness
of licensing data usually go along the lines: data equates to facts
over which no monopoly IP rights can or should be granted; as such all
data is automatically in the public domain and hence there is nothing
to license (and worse 'licensing' amounts to an attempt to 'enclose'
the public domain).
However, even those who think that open data can/should only be public
domain data still agree that it is reasonable and/or necessary to have
some set of community 'rules' or 'norms' governing usage of data.
Therefore, the question of what requirements should be allowed for
'open' data is a common one, whatever one's stance on the PD question.
Of course, even with agreement on requirements, there is still the
question of whether these should be 'enforced' through a license or
via community norms. To summarize, the three main questions are:
**Qu 1. Is it important to license?**
**Qu 2: What 'restrictive' requirements are compatible with openness?
In particular does 'open' equate to PD only or are attribution and
share-alike 'requirements' permitted?**
**Qu 3: Community norms or licenses? Should 'community norms' or
license terms be used in order to encode requirements such as
attribution and share-alike?**
Below I look at each of these in turn, laying out, as I see it, the
current consensus and expressing my own view.
## Question 1: Is it Important to License?
The simple answer here is yes. Whether one likes it or not there are a
whole bunch of jurisdictions where there are IP rights in data(bases).
Note that this does **not** imply any monopoly rights in any facts
that data represents.
Thus, even if you just want your data to be in the 'public domain',
you need to apply a license -- or something very closely resembling a
license. (A suitable example is the Open Data Commons [Public Domain
Dedication and License][pddl]).
## Question 2: What Should Openness Allow?
Despite the sometimes heated discussion, there is, in fact, broad
agreement: openness means freedom to use and reuse data in any way you
wish. The only debate is over what, if any, conditions can be imposed
when allowing use and reuse. In particular, following the example of
the software and content domains, the following two items have been
proposed as permissible exceptions to the basic rule of 'allow
1. Requirement of attribution (in a non-burdensome manner)
2. Requirement to share-alike (a reuser or share-alike material
must, when making publicly available their own material, make it
openly available under a similar share-alike license)
Everyone agrees that requiring attribution is OK. Furthermore, it also
now generally accepted that having this requirement in a license is
not be a problem.
(In the original [Protocol for Implementing Open Access
Data][protocol] attribution was alleged to be problematic due to a
potential for 'attribution stacking'. However, these concerns appear
to have been allayed. To my mind, it was never clear why data needed
to be different: code and content both have plenty of examples of
projects with many contributors, much reuse *and* an attribution
Share-alike provisions are more controversial. It has been argued that
share-alike conditions are problematic because of the potential for
incompatibility between two share-alike licenses (or community norms).
At the same time share-alike may provide an important incentive for
individuals and communities to make their data openly available since
it provides some assurance that this data will remain open. Thus, any
evaluation comes down to the balance between:
1. The costs, if any, of allowing share-alike in terms of e.g.
complexity and compatibility.
2. The benefits, if any, that share-alike provides by encouraging the
creation of open data in the first place and in ensuring subsequent
'sharing back' by those who build upon that data.
In my view the benefits are substantial while the costs are not.
Incompatibility can largely be avoided by only 'approving' share-alike
licenses that are compatible. At the same time, share-alike enshrines
a principle that is important to many communities in the code and
content spheres and same seems true of data (consider e.g. Open Street
(Aside: it is important to emphasize that permitting share-alike does
not mean it is must be used. In fact, a particular community could
recommend against using share-alike as, for example, the Python
community does for code hoping to make it into its standard library.)
## Question 3: Licenses versus Community Norms
Even if a basic license is used it can be argued that any
'requirements' for attribution or share-alike should not be in a
license but in 'community norms'. So which is best?
In my view, when making available data, licenses are much better than
community norms. Why?
1. A license is always needed even if you are taking a PD approach.
So 'norms' don't obviate the need to license.
2. A license is able to encode 'norms' both formally and informally
(for example, in a preamble -- cf. the GPL).
3. A license is likely to elicit at least as much, and almost
certainly more, conformity with its provisions than community norms.
This is especially true outside of the community. The future is likely
to see a much more mixed data landscape whether in science or
elsewhere with many 'non-community' (non-academic) business and among
ordinary citizens. (Note also that for these groups the simplicity and
formality of a license makes it superior to 'norms' in almost every
respect -- transparency, certainty etc.
* If there are concerns that, in some jurisdictions, the absence
of 'data' rights make e.g. share-alike provisions unenforceable
nothing is lost by using a license: the license de facto reverts to
the status of a community norm and any concerns regarding "false
expectations" can easily be dealt with by a simple warning.
**Flexibility:** some have argued that 'norms' are more 'flexible'
than licenses. I'm not clear what this really means:
* Flexible = not enforceable. Perhaps true but I am unclear why this
is an advantage (even to a user it is easy to comply with the open
* Flexible = leeway around the edges. For example I won't get in
trouble if I don't attribute quite right. But this is true of licenses
too: it is very unlikely anyone gets sued for a minor error in
attribution and even with share-alike no court is likely to award
damages for a mistake made in good faith -- especially if it can be
* Flexible = fuzzy. Fuzziness does not seem an attractive property
when sharing data -- both sharer and sharee want clarity.
* Flexible = easily changed. Allowing major changes is a serious
problem both for licensors and licensees (certainty and clarity would
disappear). For minor changes licenses are just as good.
Thus, in every respect I can think of, licenses are superior to
community norms when making available open data.
Summarizing the the conclusions from the above discussion we have:
Qu 0: Does this matter?
**Yes.** A good definition of openness and the use of some form of
licensing is crucial to a healthy future for the open data community
(and that will include pretty much everyone ...).
Qu 1: Is it important to license?
Ans: **A 'license' is always necessary** -- even if you advocate a
PD-only approach. There is too much variation (and uncertainty) about
what the IP situation is across the world to just go with the default.
All providers of data should apply some kind of license or PD
Qu 2: What 'restrictive' requirements are compatible with openness? In
particular does 'open' equate to PD only or are attribution and
share-alike 'requirements' permitted?
Ans: **Both attribution and share-alike should be permitted.**
Attribution is widely agreed to be acceptable. The second,
'share-alike' is more controversial, but in my view should be allowed:
there is no reason to break with the precedent set in code and content
domains and its benefits seem substantial while costs are minimal if
licenses are correctly managed.
Qu 3: Community norms or licenses?
Ans: **Use licenses when making available data.** Licenses provide all
the benefits of community norms in terms of explicitly encoding the
preferences of a community. At the same time they deliver greater
clarity and transparency, and, in many jurisdictions, provides a legal
enforceability which norms do not with regard to requirements of
attribution or share-alike.
This essay comes out of ongoing discussions over the last few years
with a large assortment of communities and individuals. The primary
motivation for sitting down and pulling the threads together came out
of reading [Michael Nielsen's post on The role of open licensing in
open science][nielsen] (+ [thread][nielsen-thread]) and recent emails
with [John Wilbanks of Science Commons][sc] on the [Open
Definition][od] coord list.
Related work and earlier discussion on this matter include:
* The [Open Definition][od]
* The [Protocol for Implementing Open Access Data][protocol]
* The [Guide to Open Data Licensing][guide]
* [Open Data Discussion on SPARC Open Data List (2006)]
* [Copyright Not Applicable to Geodata Post (2007)][geodata] (+
* The [Open Data Commons][odc]
* [CCZero license][cczero]
More information about the okfn-discuss