[open-science] toward a low-overhead fastpath for open-science "publishing"
Tom_Roche at pobox.com
Fri Aug 3 00:50:00 UTC 2012
summary: let's produce something like a fastpath (more below)
demonstrating an open-science "publishing" (more below) process
suitable for use by the very-early-career or #scholarlypoor
aspiring scientist. I sketch parts of one (below the line).
Thanks all for your interesting talk today
Since there seemed to be interest (at least, from me :-), and since I
think it's a fairly general usecase that might be useful for (e.g.)
tutorials/howto's/etc, and since I'd definitely appreciate suggestions
for improvement (esp of ease-of-use :-), here's some more detail about
what I'm trying to do, and how. (Got a better way to communicate/
collaborate on this particular topic? Please lemme know.)
What I hope will emerge from this (eventually :-) is a "fastpath" for
open-science "publishing" suitable for folks like me:
"Fastpath" (a term from my years coding for a massive three-letter
acronym) refers to a document for a demo. (Or the script for a demo
video.) A fastpath is one level below a howto--even more hands on,
with even less explanation: "type this, then click that, and you'll
get this thing." Fastpaths instill confidence (to the extent they are
completely replicable), and can convince the audience to learn more.
(Notably, to RTFM--which the fastpath's intended audience is usually
loathe to do.) Note that a fastpath doesn't need to be, and usually
isn't, the "absolute best way" to do something (which usually varies
by practitioner): it's just so drop-dead easy and robust (and is
therefore usually designed before, and developed with, the code it
documents) "even Marketing can run it." FWIW, one part of an
open-science publishing fastpath should definitely be
I put "publishing" in quotes because the domain I target includes many
activities (e.g., logging, collaboration) that terminate (more or
less) in generation of formally-structured content ("FSC"). The other
activities are unfortunately disrespected relative to that which gets
funding, tenure, etc--such is life.
And, in my own self-interest, I'm also definitely interested in
assistance and mentoring with my own effort in this space, which
follows; your comments/suggestions are appreciated.
AUDIENCE: very-early-career science students (I'm a master's student)
and workers who
1 are reasonably "neterate," i.e., can work with fairly high-level
descriptions of internet-available sites/services and the tools/
protocols available for their access
2 must publish (if only a thesis) to "get ahead" (i.e., the
second-lowest level of the precariat), but haven't yet (or not
3 want to "be open," i.e., to have their project's plans, data, and
artifacts publicly available (minimally, for collaboration). Extra
credit for exaltation of "replication in computational science"
(and thanks to Cameron Neylon for passing that pointer)
4 are #scholarlypoor in
- time, notably because they're so busy learning the domain about
which they need to publish that they lack time to invest in
optimizing publishing tools and processes
- money, to pay someone(s) to handle the more strictly publishing-
oriented tasks which they might prefer to offload.
Note my audience is more early-career, and my task (below) more
individually focused, than would suit someone like Ethan Perlstein,
whose Journal Wordpress
I see as targeting the science worker at the next career level up: a
team builder who is necessarily more focused on team funding,
management, and marketing.
TASK: with minimum investment of (e.g.) time, money, effort,
1 make project content (e.g., plans, data, visualizations) publicly
available *over time*: i.e., so as to be
* easily editable (including adding, subtracting, and linking
artifacts) by authors and collaborators
* easily citable (from, e.g., webpages, email, blogs)
* secure (from vandalism and loss)
at annual scale (e.g., 1-5 years, or however long a PhD takes
wherever you are).
2 ease migration/transformation of project content from state="raw"
(i.e., primarily consumed by one's team) to state="publishable"
(i.e., suitable for formal submission to formal internal or external
publishers or reviewers, in the formats demanded by the latter).
STRATEGY: I'm either doing or intending to do this now by
1 using a github wiki to maintain a group of pages and associated
media relating to the project, including a top-level page (currently
) from which one should be able to discover
* project intent
* project status
* find what one seeks within the project (provided it's there :-)
Unfortunately github's wiki docs are (IMHO) relatively poor, though,
since the wiki is a repo, one benefits from the generally-excellent
2 storing and processing data, and generating related products (e.g.,
visualizations), with maximum automation and reproducibility. This
2.1 storing both raw and processed data and products in public
I'll need to keep local copies of processable data anyway (notably
due to latency and processing requirements), and my organization
has a capable-looking hierarchical filesystem, so I'm not so
concerned about data security when evaluating public datastores. I
also understand (better, after the talk) the costs associated with
DOI minting and maintenance, and must accommodate that. But my
datasets are fairly large (GBs, not yet TBs), so capacity is
definitely an issue. So I'm thinking I will
* publish DOI-worthy data and products at public DOI providers
(e.g., pangaea.de (earth-science-specific), figshare.com), so as
to stay within space limits for each provider.
* publish raw-er data and products, for which DOIs are less
necessary, at non-DOI-providers such as thedatahub.org et al
(e.g., github), so as to stay within space limits for each
(datadryad.org also seems useful, but *much* too late in the
process. Why wait to open-store your data until you have a
published article? Am I missing something?)
2.2 processing data and generating products using open-source and
publicly-available code and "engines" (e.g., compilers, VMs)
I'm only partly compliant: I use mostly R for my own work, but
generate the base-level data using legacy models which currently
rely on proprietary compilers.
2.3 managing code with open-source SCMS and public datastores (aka
git+github seems fine for this; but, as with previous, there are a
lotta good tools and services in this space. (That being said,
github has some excellent documents that could be easily
incorporated into a fastpath.) I write R, but have not been
packaging it: gotta start doing that.
3 managing references with open-source RMS and public datastores.
Lotta range here, from all-the-way FOSS (e.g., zotero) to mixed
public/proprietary (e.g., mendeley, IIUC).
4 generating formally-structured content ("FSC"--e.g., articles,
posters, presentations) directly (with maximum automation and
reproducability) from more process-oriented content ("POC"--in the
above, wiki(s) and datastore(s)).
PROBLEMS: include, more-or-less in order from least to most severe:
1 Gollum, the github's wiki's rendering engine--at least, the
publicly-deployed version--is butt-ugly. For aesthetics, I miss
mediawiki, but the version to which I have access is behind a
firewall, for which it's hard to enable access. (And MW does not
support interaction via git repo, which I very much like about
github's wiki. And backing up an MW is still a PITA, at least
relative to backing up a git repo.)
2 Gollum does not currently support automatic table-of-contents
* makes page navigation *much* more difficult
* requires lots of manual internal-link generation
3 I am *quite* far from an FSC-generation solution--I've been putting
it off to generate more POC (which make the bosses happy). I need to
spend some time with Carl Boettiger's stack
and look @ dexy (nominated by Cameron Neylon)
before some upcoming conferences.
your assistance is appreciated (as is just reading this far :-)
More information about the open-science