[open-science] Making use of data: tools, process and openness
wilbanks at creativecommons.org
Mon May 9 01:25:08 UTC 2011
All good points, and well taken.
I'd add on a few bits...
first, data != knowledge, at least, not most of the time, and especially
not in the sciences. Data that's related to things where we have basic
agreement on terminology, like latitude and longitude, is a lot closer
to knowledge than data that's related to terminology that is constantly
changing, which is the case in most interesting science areas. If the
ontology isn't shifting, the science is boring. That's why most science
data management plans don't focus on whether the data are open or
closed, but on whether or not the data are annotated, whether or not the
various processes that have operated on the data have been tracked,
whether or not the software models that turned that data into something
closer to knowledge were as valid as they can be, etc - it's certainly
where the US NSF is going. IP takes about a day, the rest of this takes
second, science != science. there is not a lot of annotation that is
useful in the life sciences that is useful to the physicist, whereas the
annotation on city streets created by open street map folks can be
re-used and mashed up against city bus data that tracks against other
street data. science isn't science; it's dozens if not hundreds of
disciplines that to one level or another follow some variation of the
scientific method, but don't share ontology or ideology or practice.
note egon and peter's exchange about open bioinformatics earlier today -
from the outside, peter finds it open, from the inside, egon further
subdivides bioinformatics into sequencing and everything else, with
sequencing forced open not by ideology but by journal practice. that
little exchange lets us see an ocean in a grain of sand.
(paul david's work in this space is essential reading, if one is
interested - his papers are available at
third, data doesn't necessarily get "better" if it's open, unlike open
source software or open content like wikipedia. with enough eyes, all
data may be useful, but all data bugs aren't necessarily fixable by
enough eyes and hands. this is something that i took a long time to
understand, that i didn't get until i spent a lot of time around the
folks doing open bio data at sage and open energy data at openei. bad
data, open or closed, is actively harmful and unfixable in a way that
bad code isn't. poorly annotated data is a close second in terms of
harmfulness, but annotations - unlike the underlying measurements - can
indeed get better via incremental edits. i'd love to see a heavier focus
on open, incremental annotation systems as part of the open science
movement, not the IPR status of the underlying dataset (which often
cannot be made "open" as per OKD or other definitions for privacy
reasons, but can be made available under a very low transaction cost
contract regime if one agrees to standard terms and conditions).
obviously i think open is essential. i'm infamous for my public domain
but i am worried that the we are too often focused on open in the
service of open, not open in the service of making data useful. we ask
if it's closed, not if it's annotated, not if there's a plan to preserve
it over time, not if there's a way to know where it came from so we can
know if we trust it, not if there's a model that can turn it into
it's not like the open access movement, where we had to push back hard
against an industry that wanted - wants - to keep everything closed.
there isn't a data publishing industry to fight against. there's a group
of people all trying to figure out how to live in a data rich world. we
as the open science community are part of that group. our job is to
convince them that being part of our group is key to their success, too...
thanks rufus for provoking, looking forward to seeing you virtually on
the webinar tomorrow.
On 5/8/2011 5:09 AM, Rufus Pollock wrote:
> [Changing subject to reflect change of thread direction]
> Great comment John. I'd like to expand on it!
> In almost all the talks (e.g. ) I give I aim, at least once, to
> make the statement along the lines:
> "Openness is *not* an end in itself, it's a means to an end"
> The *real end* being the creation, processing and use
> information/knowledge more effectively for the purpose of bettering,
> in some way or other our own lives and the world around us -- be that
> finding a better way to travel to work, improved understanding and
> predictions of climate change, a better way to select a stock
> portfolio, working out who to vote for ...
> Now there are clearly a very large set of things that can contribute
> to us getting better at the "creation, processing and use of
> information" but I'd argue that the following are particularly
> important (clearly these all interlink ...):
> 1. Scalability -- i.e. to deal with large amounts of material
> 2. (Improved) Tools, techniques and process
> 3. Wide access to the raw data and content
> [I'd also add a fourth item but strictly it isn't a requirement but a
> personal desire!: 4. To do this in a collaborative, distributed and
> decentralized manner -- avoiding the centralization of information
> 'power' (be that in the actual control of information or in 'refining'
> Now I'd argue that openness -- both of data/content and of tools -- is
> really important to each of these:
> 1. Scalability - central, in my view, to successful 'data scaling'
> will be componentization: breaking up material into maintainable
> chunks (components) that can be recombined. However, without openness
> recombination will rapidly become extremely hard -- if not impossible
> -- as one has to clear rights with all of the different providers of
> 2. Tools, technique and process. Open data makes it much easier to
> develop and share tools, techniques and processes for working with
> 3. Wider access to the material: given the vast amount of material
> becoming available we're going to want as many people as possible (and
> not just 'professionals') to be able to access, experiment with and
> redistribute that data as easily as possible (cf. the many minds
> principle: the best thing to do with your data will be though of by
> someone else).
> To sum up then: I completely agree with you John that "doing useful
> stuff" with the data is the central thing (in some sense the only
> thing). We should always make clear that openness is important because
> of what it helps us do not because it is an end in itself.
> Furthermore, tooling, annotation, feedback loops  etc are
> absolutely central (which is why more I'd estimate that much more than
> 50% of the Open Knowledge Foundation's time and resource is spent on
> developing tools, platforms, techniques like CKAN  for working with
> (open) data and content).
> At the same time, as I've outlined above, I think openness is pretty
> central to making significant progress "doing useful stuff". Given
> this, I think it is important that we do "go on" about *open* data --
> not, of course, in an obsessive ideological are-you-in/are-you-out way
> but in an exhortatory, this-matters-to-what-we-can-build way. This is
> especially true at the present time, when the default in most areas
> still seems to be non-open.
> : http://m.okfn.org/files/talks/ccc_20091228/
> : http://ckan.net/ and http://ckan.org/ (software)
> On 6 May 2011 17:15, john wilbanks<wilbanks at creativecommons.org> wrote:
>> this is a trending part of the data conversation that i am seeing worldwide
>> - licenses on data are considered a tiny part of overall data management.
>> the UK is the most progressive, and the US default position of public domain
>> is nice (though we are continuing to press for its being in the PD globally,
>> and not just domestically). but when the scientists who are not on this list
>> gather to talk data with their funders, IP is rarely at the top of the
>> it is of course essential, but far from sufficient, for data to be "open" -
>> if it's not annotated, doesn't have provenance, doesn't have tracks of how
>> it came from its original raw forms to the intermediate processed forms that
>> are so much more useful, doesn't have tracks of the feedback loops that
>> processed it, etc.
>> i now spend much of my time these days *outside* the open science world, in
>> what they call out here the "big data" world. most of the data people i talk
>> to are obsessed with all of the ways data is made a) useful via tooling and
>> annotation and b) social (in the sense of the feedback loops and
>> conversations among big data users, not in the facebook sense) and are not
>> very concerned with the openness of data in the absence of a) and b).
>> we've got to get serious as a community about addressing these things
>> ourselves, or we risk becoming a one-note community obsessed with whether or
>> not data is "open" rather than joining the debate about how to make data
>> *useful* - and making the open argument a key to that broader one.
>> My .02
>> On 5/6/2011 4:55 AM, Jonathan Gray wrote:
>>> Surprised to see only one mention of "open data" on page 30. A shame that
>>> there isn't a brief step by step guide to openly licensing data in the
>>> Does anyone know any of the authors of this that we could contact - to see
>>> if they'd consider putting this in next time?
>>> ---------- Forwarded message ----------
>>> The UK Data Archive has just published the 3rd edition of its 'Managing
>>> Sharing Data - best practice for researchers' guide.
>>> It is available online at: *
>>> Hard copies can be requested from *Communications
>>> enquiries*<comms at data-archive.ac.uk>
>>> This edition contains much new content, illustrated with numerous case
>>> studies. Guidance is aimed at researchers across the natural and social
>>> sciences and humanities, and covers:
>>> - why and how to share research data
>>> - data management planning and costing
>>> - documenting data
>>> - formatting data
>>> - storing data
>>> - ethics and consent in data sharing
>>> - data copyright
>>> - data management strategies for large investments
>>> New and updated guidance results from working closely with researchers,
>>> centres and programmes within the JISC-funded Data Management Planning for
>>> ESRC Research Data-rich Investments (DMP-ESRC) project.
>>> The guide is published thanks to funding from the Joint Information
>>> Committee (JISC), the Rural Economy and Land Use (Relu) Programme and the
>>> Data Archive.
>>> VAN DEN EYNDEN
>>> RESEARCH DATA MANAGEMENT SUPPORT SERVICES& RELU DATA SUPPORT SERVICE
>>> UK *DATA ARCHIVE*
>>> UNIVERSITY OF ESSEX
>>> WIVENHOE PARK
>>> ESSEX, CO4 3SQ
>>> *T* +44(0)1206 872234; 07768432422
>>> *E* *veerle at essex.ac.uk*<http://firstname.lastname@example.org>
>>> *W **www.data-archive.ac.uk*<http://www.data-archive.ac.uk/>
>>> ENSURING CONTINUOUS ACCESS TO HIGH QUALITY RESEARCH DATA
>>> Legal Disclaimer: Any views expressed by the sender of this message are
>>> necessarily those of the UK Data Archive or the ESRC.
>>> This email and any files transmitted with it are confidential and intended
>>> solely for the use of the individual(s) or entity to whom they are
>>> open-science mailing list
>>> open-science at lists.okfn.org
>> John Wilbanks
>> VP for Science
>> Creative Commons
>> web: http://creativecommons.org/science
>> blog: http://scienceblogs.com/commonknowledge
>> twitter: @wilbanks
>> open-science mailing list
>> open-science at lists.okfn.org
VP for Science
More information about the open-science