[open-science] Fwd: Open data and Panton Principles

Mon Jun 28 15:33:07 UTC 2010

Elizabeth,

I'm at your disposal if you'd ever like to chat in real time, and commend you on your work to release data. It's an unfortunate side effect to a robust movement and debate that it's far too easy to jump on those who don't "comply" with our requirements without first saying "huzzah." I'm often guilty of this :-)

I think there are at least three issues tangled up in here, which I will try to detangle.  I will be cross-posting this to my blog in some version, but this is a bit stream of consciousness. Apologies for length, but this is vital to get right.

The first is, "how successful are the principles?" - which is implicit in your question of who is "adopting" them. The second is the interaction of legal tools around intellectual property with other rights related to human subjects data. And the third is sustainability of data archives, which creates an incentive to contemplate "monetization" of data and/or use of tools that reserve commercial rights.

I'll try to speak to each, but I'm speaking for myself here, not presuming to speak for the list or the other authors of the Panton Principles. Indeed it's fair to say that Rufus and I have very different positions on some of these topics, as most folks around here know :-)

But that actually is related to the first issue. The principles are already a success to me, because the audience for them in many ways is the group of people who were trying to understand how some of the leaders in open data could reconcile the philosophical differences that naturally emerge in a decentralized movement. The differences of opinion inside open data were strong enough to be creating waves, and those waves could have been exploited by forces *outside* the open data world against us. Although I have a lively debate with Rufus and others on these issues, they're issues to us of how best to share, and the true threat to sharing regimes is from non-commons models. Panton to me is a spirited proof that we can put differences aside and agree on how to work together in a commons. 

"Adoption" to me is often a red herring. The organizations you cite in your email don't actually make data - so getting them to sign on is not my priority. And getting organizations to "sign on" in general isn't my priority - I'm trying to get organizations to "release data"! Thus I would point to GSK's recent use of CC0, the Polar Information Commons' formation and commitment to the public domain, rather than to signatories to a web page. There are more than a dozen OA declarations and thousands of signatories to them, but the relevant number is 16%, which is the amount of literature that is actually OA. My personal focus is on getting foundations that invest in science to implement public domain terms as relate to IPRs in their funding agreements, which is PP compliant, but isn't about encouraging orgs to sign the principles as a declaration. In my experience getting a signature can actually release the pressure to *do something* about the data - one can say one has signed, and do nothing.

The second issue is related to the rationale we've long held at Creative Commons about data in science (whether social or "hard" science), which is that the public domain creates the least interoperability problems. Issues of confidentiality, privacy, and human subjects protection are subject to a large, complex, and often contradictory set of legal regimes that are not internationally harmonized like copyright, and that are rapidly being overwhelmed by the advance of technological progress. The success of the folks who broke the Netflix data by re-identifying the people through cross-reference to other publicly available data is a good example of this - the Netflix data itself was de-identified, but it turns out that a few smart folks with lots of processor power can cross-reference to other movie review databases and bang, identify people uniquely.

There is a very real risk of conflict between "licensing" and the sorts of good citizen regimes that need to be implemented to share this kind of sensitive data. We're grinding away on these issues at CC, between our work with the Sage Commons, some emerging clean energy commons efforts, and climate change projects in which citizens gather data on air quality via their cell phones. Copyright is tough enough to deal with, but when one adds in unique sui generis database rights and contracts, it gets nearly untenable, which is why three years ago we recommended the public domain as a solution - it leaves one in a situation where the only issues to be dealt with are privacy and technical formats..."only" being a relative term of course. Otherwise, it's easy to see a world where there are multiple contradictory requirements imposed on users and re-users of data, where there is a contract to protect privacy, a contract to mandate share alike, a database right trigger, and a copyright trigger *even if all four were simply in support of attribution*. 

There is also the very real issue of technology as regards any licensing based on copying. As the semantic web becomes a reality, and data gets bigger, the need and the desire to make copies decreases. Thus, anything based on copying is likely to fade in power over time - because we'll simply all run federated queries. But that's another email. 

My instinct is that the right way to share human subjects data (until we reach a cultural shift where there is non discrimination legislation and where people "own" their own data, rather than Facebook owning their data) is through a tiered stack of services and data provision, where the IP status is public domain, but where there is some norms encoding and liability to prevent abuse. That encoding might be through aggressive statistical "jittering" of data to de-identify it, it might be through some of the "differential privacy" techniques that allow service-based access to data or homomorphic encryption to allow for algorithmic access to sensitive data without its unencryption, or it might be through more classical approaches of request-based access to human subjects data, but lowering the transaction costs through the standardization of the various legal tools around the request and granting. We're looking at all of these here in real world situations and would be happy to provide you with some information if that's of interest. There are no easy answers.

The third issue is one of sustainability. It's non trivial. But we should look at the realities of data. Selling data that isn't marketing data isn't big business. And every transaction cost imposed on the re-use of data is likely to decrease its net value over time, because it limits the conversion of data into a platform for services and applications. The opening of data is not like the building of e-commerce storefronts. It's like laying the foundations for the internet in the 1970s and 1980s. Open data is a long way away from the world of the web, where small transactions can add up to real money. It's in a far more protean state, one that needs standards and boring technical efforts, one that awaits a sea change in science culture, one that needs entirely new technologies (as new to us today as the web browser was in the mid 1970s, and as difficult to conceive - but hopefully not as far away!).

Sustainability is a challenge that will require new business models, new cultures, good ideas in curation, and more than anything, funder commitment. It's not going to be solved via legal tools. I'd beware of any magic pixie dust license that promises sustainability via non-commercial terms. To the extent that NC terms work in my own experience, it's for individuals who want their songs and photos online without exploitation - and we don't, for the most part, turn songs and photos into the basis of new research. We don't federate songs and photos (again, for the most part!) for automated database query. Thus we shouldn't apply the same legal tools. 

The public domain is not an easy decision. But it is a *simple* decision. And that, in the end, is the point. It's the only way to make a murderously complicated issue - intellectual property rights around data - simple enough that we can begin to address the other murderously complicated issues that *don't have a simple solution available*. 

jtw

On Jun 28, 2010, at 7:06 AM, Chris Rusbridge wrote:

> Speaking as someone who agreed such a restriction in the past, as the then Director of the Digital Curation Centre: we had an obligation to pursue "sustainability" as an explicit part of our funding conditions. On that basis, we decided that our default licence would be CC-BY-NC-SA. Our reasoning was that if there was money to be made from our resources, our sustainability obligation meant we had to try to be part of it. Putting a NC restriction doesn't mean "no commercial use, ever". It means "no commercial use unless you discuss with us the terms and conditions".
> 
> In the 5 years or so I was part of the DCC I think this only caused us a problem a couple of times. From memory once we agreed with no conditions (the use was pretty much equivalent to academic use), and once we agreed terms that would allow the author to earn significant royalties before a cut was due to us. I need hardly say we have as yet received nothing, as far as I know.
> 
> This is of course mostly in a text-based context. I'm not clear in my own mind in relation to data, on whether the barrier to re-use of any kind of licence (including BY and NC) can fully outweigh the need to strive for sustainability. However, I do remember investigations into Crown Copyright etc, well before the current openness regime, that suggested that only Ordnance Survey made any significant (5 figures or higher) annual returns on IP exploitation, as well as experiences from the eLib programme that misguided sustainability approaches led to potentially valuable resources being locked away. Which is all tending to make me think that commercially exploitable IPR in data will be comparatively rare, and therefore the default view SHOULD be for greater openness. 
> 
> It would help if there were a more automated approach to the data citation problem since then one could at least demonstrate impact more easily, even if not sustainability!
> 
> --
> Chris Rusbridge
> Consultant
> Mobile: +44 791 7423828
> Email: c.rusbridge at gmail.com
> 
> 
> 
> 
> On 28 Jun 2010, at 14:49, Mr. Puneet Kishor wrote:
> 
>> 
>> Libby Bishop wrote:
>> 
>>> Here at UKDA, we advocate free access to data for non-commercial use.
>> 
>> 
>> Could you please elaborate your reasoning behind your decision to restrict use of data to non-commercial use (assuming the original depositor has not put any such condition, and assuming there are no privacy issues).
>> 
>> I continue to be puzzled and befuddled by open access advocates continuing to push and peddle non-commercial restrictions, as if non-commercial is virtuous. Why? Why the distrust and dislike of things commercial?
>> 
>> 
>> -- 
>> Puneet Kishor http://punkish.org
>> Carbon Model http://carbonmodel.org
>> Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
>> Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
>> Nelson Institute, UW-Madison http://www.nelson.wisc.edu
>> -----------------------------------------------------------------------
>> Assertions are politics; backing up assertions with evidence is science
>> =======================================================================
>> 
>> 
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
> 
> 
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science