[open-science] the early-career guide to doing open science?

Fri Mar 16 15:02:57 UTC 2012

summary: are there guides to, e.g., archiving and enabling access to
science inputs and outputs? esp for the under-resourced, early-career
scientist-in-training.

details:

I'm a former software engineer, now a graduate student in atmospheric
modeler. My products as a computational scientist will continue to be
"soft" (whether, e.g., code, documents, graphics), and will therefore
have needs similar to those of open-source software (OSS) projects:
e.g., version control, backup, public access. (Hence I still consider
myself very much a software engineer, though my colleagues seem to see
themselves as scientists who just happen to work with software--but
that's a separate matter.) As a coder I have worked, and continue to
work, on several OSS projects, and am fairly familiar with the various
distributed version-control systems (DVCS, e.g., git) and cloud-based
platforms for OSS development.

I'd like to learn more about best practices (and, frankly, cheap
practices :-) for similarly maintaining and (for want of a better
term) "opening" one's scientific products, whether finished or under
development). Ideally I'd like to also

* keep one copy of important data on my cluster, and another in a
  cloud repository

* version important data as it's received and processed

* version analytics (e.g., plots that take more than a minute, or that
  require significant setup, to produce) as they are updated

similar to the manner in which one uses syncable local and cloud DVCS
for the code that processes and analyzes that data. I could then

* point colleagues at the cloud repository for collaboration

* reference a branch of my project as supplemental information for
  publications

* do "automated build" of publications out of the repository, in the
  manner that installable software is built from sources

* incorporate branched data from others' repositories as needed

I am currently hosting a small part of my current project on free OSS
sites. But, unlike most straight-code projects, data (whether raw or
processed) must also be managed, in volume. Unfortunately, the free
sites of which I'm aware usually

- provide what are, for me at least, small filespaces (scale ~= 1 GB).

- disallow versioning of large files and binaries (e.g., netCDF data) 

Given my status, and the state of science funding in the US, free
repositories are all I can afford for the foreseeable future. I would
hope that one or more of the institutions with which I am affiliated
would provide functionality suited to open-science workflows such as
the above, but that does not seem forthcoming. (They seem much more
interested in keeping all but officially-approved content "inside the
firewall," which is partly understandable, but greatly restricts
collaboration and openness.)

I find the open ethos compelling in both domains--science and
software--both normatively (esp for more policy-relevant science) and
positively/pragmatically ("more eyes make shallow bugs"). Hence I'm
hoping that there is already support for the type of workflows sketched
above (and to solutions for other problems for open scientists of which
I am as yet blissfully unaware :-), and that folks out there can pass
pointers to sites, groups, or docs describing their use.

TIA, Tom Roche <Tom_Roche at pobox.com>