[open-science] toward a low-overhead fastpath for open-science "publishing"
pm286 at cam.ac.uk
Fri Aug 3 07:43:21 UTC 2012
This is great Tom,
On Fri, Aug 3, 2012 at 1:50 AM, Tom Roche <Tom_Roche at pobox.com> wrote:
> summary: let's produce something like a fastpath (more below)
> demonstrating an open-science "publishing" (more below) process
> suitable for use by the very-early-career or #scholarlypoor
> aspiring scientist. I sketch parts of one (below the line).
> Thanks all for your interesting talk today
> Since there seemed to be interest (at least, from me :-),
and from me!
> and since I
> think it's a fairly general usecase that might be useful for (e.g.)
> tutorials/howto's/etc, and since I'd definitely appreciate suggestions
> for improvement (esp of ease-of-use :-), here's some more detail about
> what I'm trying to do, and how. (Got a better way to communicate/
> collaborate on this particular topic? Please lemme know.)
> I am sure Sophie Kershaw - Panton Fellow - reads this but I've copied her
in as she is running a 3- week training course on data management for
What I hope will emerge from this (eventually :-) is a "fastpath " for
> open-science "publishing" suitable for folks like me:
> "Fastpath" (a term from my years coding for a massive three-letter
> acronym) refers to a document for a demo. (Or the script for a demo
> video.) A fastpath is one level below a howto--even more hands on,
> with even less explanation: "type this, then click that, and you'll
> get this thing." Fastpaths instill confidence (to the extent they are
> completely replicable), and can convince the audience to learn more.
> (Notably, to RTFM--which the fastpath's intended audience is usually
> loathe to do.) Note that a fastpath doesn't need to be, and usually
> isn't, the "absolute best way" to do something (which usually varies
> by practitioner): it's just so drop-dead easy
for the users! It isn't drop-dead easy to create!
> and robust (and is
> therefore usually designed before, and developed with, the code it
> documents) "even Marketing can run it." FWIW, one part of an
> open-science publishing fastpath should definitely be
> I put "publishing" in quotes because the domain I target includes many
> activities (e.g., logging, collaboration) that terminate (more or
> less) in generation of formally-structured content ("FSC"). The other
> activities are unfortunately disrespected relative to that which gets
> funding, tenure, etc--such is life.
Yes. This is a major problem but at least funders are talking about it.
> And, in my own self-interest, I'm also definitely interested in
> assistance and mentoring with my own effort in this space, which
> follows; your comments/suggestions are appreciated.
> AUDIENCE: very-early-career science students (I'm a master's student)
> and workers who
> 1 are reasonably "neterate," i.e., can work with fairly high-level
> descriptions of internet-available sites/services and the tools/
> protocols available for their access
> 2 must publish (if only a thesis) to "get ahead" (i.e., the
> second-lowest level of the precariat), but haven't yet (or not
> 3 want to "be open," i.e., to have their project's plans, data, and
> artifacts publicly available (minimally, for collaboration). Extra
> credit for exaltation of "replication in computational science"
> (and thanks to Cameron Neylon for passing that pointer)
> 4 are #scholarlypoor in
> - time, notably because they're so busy learning the domain about
> which they need to publish that they lack time to invest in
> optimizing publishing tools and processes
> - money, to pay someone(s) to handle the more strictly publishing-
> oriented tasks which they might prefer to offload.
> Note my audience is more early-career, and my task (below) more
> individually focused, than would suit someone like Ethan Perlstein,
> whose Journal Wordpress
> I see as targeting the science worker at the next career level up: a
> team builder who is necessarily more focused on team funding,
> management, and marketing.
> TASK: with minimum investment of (e.g.) time, money, effort,
> 1 make project content (e.g., plans, data, visualizations) publicly
> available *over time*: i.e., so as to be
> * easily editable (including adding, subtracting, and linking
> artifacts) by authors and collaborators
> * easily citable (from, e.g., webpages, email, blogs)
> * secure (from vandalism and loss)
> at annual scale (e.g., 1-5 years, or however long a PhD takes
> wherever you are).
> 2 ease migration/transformation of project content from state="raw"
> (i.e., primarily consumed by one's team) to state="publishable"
> (i.e., suitable for formal submission to formal internal or external
> publishers or reviewers, in the formats demanded by the latter).
> STRATEGY: I'm either doing or intending to do this now by
> 1 using a github wiki to maintain a group of pages and associated
> media relating to the project, including a top-level page (currently
> ) from which one should be able to discover
> * project intent
> * project status
> * find what one seeks within the project (provided it's there :-)
> Unfortunately github's wiki docs are (IMHO) relatively poor, though,
> since the wiki is a repo, one benefits from the generally-excellent
> git docs.
> 2 storing and processing data, and generating related products (e.g.,
> visualizations), with maximum automation and reproducibility. This
> 2.1 storing both raw and processed data and products in public
> I'll need to keep local copies of processable data anyway (notably
> due to latency and processing requirements), and my organization
> has a capable-looking hierarchical filesystem, so I'm not so
> concerned about data security when evaluating public datastores. I
> also understand (better, after the talk) the costs associated with
> DOI minting and maintenance, and must accommodate that. But my
> datasets are fairly large (GBs, not yet TBs), so capacity is
> definitely an issue. So I'm thinking I will
> * publish DOI-worthy data and products at public DOI providers
> (e.g., pangaea.de (earth-science-specific), figshare.com), so as
> to stay within space limits for each provider.
> * publish raw-er data and products, for which DOIs are less
> necessary, at non-DOI-providers such as thedatahub.org et al
> (e.g., github), so as to stay within space limits for each
> (datadryad.org also seems useful, but *much* too late in the
> process. Why wait to open-store your data until you have a
> published article? Am I missing something?)
> 2.2 processing data and generating products using open-source and
> publicly-available code and "engines" (e.g., compilers, VMs)
> I'm only partly compliant: I use mostly R for my own work, but
> generate the base-level data using legacy models which currently
> rely on proprietary compilers.
> 2.3 managing code with open-source SCMS and public datastores (aka
> git+github seems fine for this; but, as with previous, there are a
> lotta good tools and services in this space. (That being said,
> github has some excellent documents that could be easily
> incorporated into a fastpath.) I write R, but have not been
> packaging it: gotta start doing that.
> 3 managing references with open-source RMS and public datastores.
> Lotta range here, from all-the-way FOSS (e.g., zotero) to mixed
> public/proprietary (e.g., mendeley, IIUC).
> 4 generating formally-structured content ("FSC"--e.g., articles,
> posters, presentations) directly (with maximum automation and
> reproducability) from more process-oriented content ("POC"--in the
> above, wiki(s) and datastore(s)).
> PROBLEMS: include, more-or-less in order from least to most severe:
> 1 Gollum, the github's wiki's rendering engine--at least, the
> publicly-deployed version--is butt-ugly. For aesthetics, I miss
> mediawiki, but the version to which I have access is behind a
> firewall, for which it's hard to enable access. (And MW does not
> support interaction via git repo, which I very much like about
> github's wiki. And backing up an MW is still a PITA, at least
> relative to backing up a git repo.)
> 2 Gollum does not currently support automatic table-of-contents
> generation, which
> * makes page navigation *much* more difficult
> * requires lots of manual internal-link generation
> 3 I am *quite* far from an FSC-generation solution--I've been putting
> it off to generate more POC (which make the bosses happy). I need to
> spend some time with Carl Boettiger's stack
> and look @ dexy (nominated by Cameron Neylon)
> before some upcoming conferences.
> your assistance is appreciated (as is just reading this far :-)
I have read it. There's an awful lot and it will take a lot of coordination
as well as implementation. Can it be done in stages?
And not also that there are several groups thinking like you who are doing
bottom-up things. Much of this may be integratoin - and from my own
experience that takes time.
But good luck
> open-science mailing list
> open-science at lists.okfn.org
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the open-science