[open-science] Making use of data: tools, process and openness

Sun May 8 23:58:46 UTC 2011

All good points, and well taken.

I'd add on a few bits...

first, data != knowledge, at least, not most of the time, and especially 
not in the sciences. Data that's related to things where we have basic 
agreement on terminology, like latitude and longitude, is a lot closer 
to knowledge than data that's related to terminology that is constantly 
changing, which is the case in most interesting science areas. If the 
ontology isn't shifting, the science is boring. That's why most science 
data management plans don't focus on whether the data are open or 
closed, but on whether or not the data are annotated, whether or not the 
various processes that have operated on the data have been tracked, 
whether or not the software models that turned that data into something 
closer to knowledge were as valid as they can be, etc - it's certainly 
where the US NSF is going. IP takes about a day, the rest of this takes 
months.

second, science != science. there is not a lot of annotation that is 
useful in the life sciences that is useful to the physicist, whereas the 
annotation on city streets created by open street map folks can be 
re-used and mashed up against city bus data that tracks against other 
street data. science isn't science; it's dozens if not hundreds of 
disciplines that to one level or another follow some variation of the 
scientific method, but don't share ontology or ideology or practice. 
note egon and peter's exchange about open bioinformatics earlier today - 
from the outside, peter finds it open, from the inside, egon further 
subdivides bioinformatics into sequencing and everything else, with 
sequencing forced open not by ideology but by journal practice. that 
little exchange lets us see an ocean in a grain of sand.

(paul david's work in this space is essential reading, if one is 
interested - his papers are available at 
http://ideas.repec.org/e/pda76.html)

third, data doesn't necessarily get "better" if it's open, unlike open 
source software or open content like wikipedia. with enough eyes, all 
data may be useful, but all data bugs aren't necessarily fixable by 
enough eyes and hands. this is something that i took a long time to 
understand, that i didn't get until i spent a lot of time around the 
folks doing open bio data at sage and open energy data at openei. bad 
data, open or closed, is actively harmful and unfixable in a way that 
bad code isn't. poorly annotated data is a close second in terms of 
harmfulness, but annotations - unlike the underlying measurements - can 
indeed get better via incremental edits. i'd love to see a heavier focus 
on open, incremental annotation systems as part of the open science 
movement, not the IPR status of the underlying dataset (which often 
cannot be made "open" as per OKD or other definitions for privacy 
reasons, but can be made available under a very low transaction cost 
contract regime if one agrees to standard terms and conditions).

obviously i think open is essential. i'm infamous for my public domain 
preferences.

but i am worried that the we are too often focused on open in the 
service of open, not open in the service of making data useful. we ask 
if it's closed, not if it's annotated, not if there's a plan to preserve 
it over time, not if there's a way to know where it came from so we can 
know if we trust it, not if there's a model that can turn it into 
something useful.

it's not like the open access movement, where we had to push back hard 
against an industry that wanted - wants - to keep everything closed. 
there isn't a data publishing industry to fight against. there's a group 
of people all trying to figure out how to live in a data rich world. we 
as the open science community are part of that group. our job is to 
convince them that being part of our group is key to their success, too...

thanks rufus for provoking, looking forward to seeing you virtually on 
the webinar tomorrow.

jtw

On 5/8/2011 5:09 AM, Rufus Pollock wrote:
> [Changing subject to reflect change of thread direction]
>
> Great comment John. I'd like to expand on it!
>
> In almost all the talks (e.g. [1]) I give I aim, at least once, to
> make the statement along the lines:
>
> "Openness is *not* an end in itself, it's a means to an end"
>
> The *real end* being the creation, processing and use
> information/knowledge more effectively for the purpose of bettering,
> in some way or other our own lives and the world around us -- be that
> finding a better way to travel to work, improved understanding and
> predictions of climate change, a better way to select a stock
> portfolio, working out who to vote for ...
>
> Now there are clearly a very large set of things that can contribute
> to us getting better at the "creation, processing and use of
> information" but I'd argue that the following are particularly
> important (clearly these all interlink ...):
>
> 1. Scalability -- i.e. to deal with large amounts of material
> 2. (Improved) Tools, techniques and process
> 3. Wide access to the raw data and content
>
> [I'd also add a fourth item but strictly it isn't a requirement but a
> personal desire!: 4. To do this in a collaborative, distributed and
> decentralized manner -- avoiding the centralization of information
> 'power' (be that in the actual control of information or in 'refining'
> (processing))]
>
> Now I'd argue that openness -- both of data/content and of tools -- is
> really important to each of these:
>
> 1. Scalability - central, in my view, to successful 'data scaling'
> will be componentization: breaking up material into maintainable
> chunks (components) that can be recombined. However, without openness
> recombination will rapidly become extremely hard -- if not impossible
> -- as one has to clear rights with all of the different providers of
> data.
>
> 2. Tools, technique and process. Open data makes it much easier to
> develop and share tools, techniques and processes for working with
> data.
>
> 3. Wider access to the material: given the vast amount of material
> becoming available we're going to want as many people as possible (and
> not just 'professionals') to be able to access, experiment with and
> redistribute that data as easily as possible (cf. the many minds
> principle: the best thing to do with your data will be though of by
> someone else).
>
> To sum up then: I completely agree with you John that "doing useful
> stuff" with the data is the central thing (in some sense the only
> thing). We should always make clear that openness is important because
> of what it helps us do not because it is an end in itself.
>
> Furthermore, tooling, annotation, feedback loops [2] etc are
> absolutely central (which is why more I'd estimate that much more than
> 50% of the Open Knowledge Foundation's time and resource is spent on
> developing tools, platforms, techniques like CKAN [2] for working with
> (open) data and content).
>
> At the same time, as I've outlined above, I think openness is pretty
> central to making significant progress "doing useful stuff". Given
> this, I think it is important that we do "go on" about *open* data --
> not, of course, in an obsessive ideological are-you-in/are-you-out way
> but in an exhortatory, this-matters-to-what-we-can-build way. This is
> especially true at the present time, when the default in most areas
> still seems to be non-open.
>
> Rufus
>
> [1]: http://m.okfn.org/files/talks/ccc_20091228/
> [2]: http://ckan.net/ and http://ckan.org/ (software)
>
> On 6 May 2011 17:15, john wilbanks<wilbanks at creativecommons.org>  wrote:
>> this is a trending part of the data conversation that i am seeing worldwide
>> - licenses on data are considered a tiny part of overall data management.
>> the UK is the most progressive, and the US default position of public domain
>> is nice (though we are continuing to press for its being in the PD globally,
>> and not just domestically). but when the scientists who are not on this list
>> gather to talk data with their funders, IP is rarely at the top of the
>> agenda.
>>
>> it is of course essential, but far from sufficient, for data to be "open" -
>> if it's not annotated, doesn't have provenance, doesn't have tracks of how
>> it came from its original raw forms to the intermediate processed forms that
>> are so much more useful, doesn't have tracks of the feedback loops that
>> processed it, etc.
>>
>> i now spend much of my time these days *outside* the open science world, in
>> what they call out here the "big data" world. most of the data people i talk
>> to are obsessed with all of the ways data is made a) useful via tooling and
>> annotation and b) social (in the sense of the feedback loops and
>> conversations among big data users, not in the facebook sense) and are not
>> very concerned with the openness of data in the absence of a) and b).
>>
>> we've got to get serious as a community about addressing these things
>> ourselves, or we risk becoming a one-note community obsessed with whether or
>> not data is "open" rather than joining the debate about how to make data
>> *useful* - and making the open argument a key to that broader one.
>>
>> My .02
>>
>> jtw
>>
>> On 5/6/2011 4:55 AM, Jonathan Gray wrote:
>>>
>>> Surprised to see only one mention of "open data" on page 30. A shame that
>>> there isn't a brief step by step guide to openly licensing data in the
>>> guide!
>>>
>>> Does anyone know any of the authors of this that we could contact - to see
>>> if they'd consider putting this in next time?
>>>
>>> J.
>>>
>>> ---------- Forwarded message ----------
>>> The UK Data Archive has just published the 3rd edition of its 'Managing
>>> and
>>> Sharing Data - best practice for researchers' guide.
>>> It is available online at: *
>>>
>>> http://www.data-archive.ac.uk/media/2894/managingsharing.pdf*<http://www.data-archive.ac.uk/media/2894/managingsharing.pdf>
>>> Hard copies can be requested from *Communications
>>> enquiries*<comms at data-archive.ac.uk>
>>> .
>>> This edition contains much new content, illustrated with numerous case
>>> studies. Guidance is aimed at researchers across the natural and social
>>> sciences and humanities, and covers:
>>>
>>>      - why and how to share research data
>>>     - data management planning and costing
>>>     - documenting data
>>>     - formatting data
>>>     - storing data
>>>     - ethics and consent in data sharing
>>>     - data copyright
>>>     - data management strategies for large investments
>>>
>>> New and updated guidance results from working closely with researchers,
>>> centres and programmes within the JISC-funded Data Management Planning for
>>> ESRC Research Data-rich Investments (DMP-ESRC) project.
>>> The guide is published thanks to funding from the Joint Information
>>> Systems
>>> Committee (JISC), the Rural Economy and Land Use (Relu) Programme and the
>>> UK
>>> Data Archive.
>>>
>>> *VEERLE*
>>> VAN DEN EYNDEN
>>>
>>> MANAGER
>>> RESEARCH DATA MANAGEMENT SUPPORT SERVICES&    RELU DATA SUPPORT SERVICE
>>>
>>> UK *DATA ARCHIVE*
>>> UNIVERSITY OF ESSEX
>>> WIVENHOE PARK
>>> COLCHESTER
>>> ESSEX, CO4 3SQ
>>>
>>> *T* +44(0)1206 872234; 07768432422
>>> *E*  *veerle at essex.ac.uk*<http://veerle@essex.ac.uk>
>>> *W **www.data-archive.ac.uk*<http://www.data-archive.ac.uk/>
>>> *_______________________________________________________*
>>> ENSURING CONTINUOUS ACCESS TO HIGH QUALITY RESEARCH DATA
>>>
>>>
>>> ..........................................................................................................................................................................
>>> Legal Disclaimer:  Any views expressed by the sender of this message are
>>> not
>>> necessarily those of the UK Data Archive or the ESRC.
>>> This email and any files transmitted with it are confidential and intended
>>> solely for the use of the individual(s) or entity to whom they are
>>> addressed.
>>>
>>> ..........................................................................................................................................................................
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> open-science mailing list
>>> open-science at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/open-science
>>
>> --
>> John Wilbanks
>> VP for Science
>> Creative Commons
>> web: http://creativecommons.org/science
>> blog: http://scienceblogs.com/commonknowledge
>> twitter: @wilbanks
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>>
>
>
>

-- 
John Wilbanks
VP for Science
Creative Commons
web: http://creativecommons.org/science
blog: http://scienceblogs.com/commonknowledge
twitter: @wilbanks