[open-science] toward a low-overhead fastpath for open-science "publishing"

Fri Aug 3 00:50:00 UTC 2012

summary: let's produce something like a fastpath (more below)
demonstrating an open-science "publishing" (more below) process
suitable for use by the very-early-career or #scholarlypoor
aspiring scientist. I sketch parts of one (below the line).

details:

Thanks all for your interesting talk today

http://ckan.okfnpad.org/meetup-2012-08-02

Since there seemed to be interest (at least, from me :-), and since I
think it's a fairly general usecase that might be useful for (e.g.)
tutorials/howto's/etc, and since I'd definitely appreciate suggestions
for improvement (esp of ease-of-use :-), here's some more detail about
what I'm trying to do, and how. (Got a better way to communicate/
collaborate on this particular topic? Please lemme know.)

What I hope will emerge from this (eventually :-) is a "fastpath" for
open-science "publishing" suitable for folks like me:

"Fastpath" (a term from my years coding for a massive three-letter
acronym) refers to a document for a demo. (Or the script for a demo
video.) A fastpath is one level below a howto--even more hands on,
with even less explanation: "type this, then click that, and you'll
get this thing." Fastpaths instill confidence (to the extent they are
completely replicable), and can convince the audience to learn more.
(Notably, to RTFM--which the fastpath's intended audience is usually
loathe to do.) Note that a fastpath doesn't need to be, and usually
isn't, the "absolute best way" to do something (which usually varies
by practitioner): it's just so drop-dead easy and robust (and is
therefore usually designed before, and developed with, the code it
documents) "even Marketing can run it." FWIW, one part of an
open-science publishing fastpath should definitely be

http://trac.ckan.org/ticket/2796

I put "publishing" in quotes because the domain I target includes many
activities (e.g., logging, collaboration) that terminate (more or
less) in generation of formally-structured content ("FSC"). The other
activities are unfortunately disrespected relative to that which gets
funding, tenure, etc--such is life.

And, in my own self-interest, I'm also definitely interested in
assistance and mentoring with my own effort in this space, which
follows; your comments/suggestions are appreciated.

----------------------------------------------------------------------

AUDIENCE: very-early-career science students (I'm a master's student)
and workers who

1 are reasonably "neterate," i.e., can work with fairly high-level
  descriptions of internet-available sites/services and the tools/
  protocols available for their access

2 must publish (if only a thesis) to "get ahead" (i.e., the
  second-lowest level of the precariat), but haven't yet (or not
  enough)

3 want to "be open," i.e., to have their project's plans, data, and
  artifacts publicly available (minimally, for collaboration). Extra
  credit for exaltation of "replication in computational science"

http://ivory.idyll.org/blog/replication-i.html

  (and thanks to Cameron Neylon for passing that pointer)

4 are #scholarlypoor in

- time, notably because they're so busy learning the domain about
  which they need to publish that they lack time to invest in
  optimizing publishing tools and processes

- money, to pay someone(s) to handle the more strictly publishing-
  oriented tasks which they might prefer to offload.

Note my audience is more early-career, and my task (below) more
individually focused, than would suit someone like Ethan Perlstein,
whose Journal Wordpress

http://lists.okfn.org/pipermail/open-science/2012-July/001754.html

I see as targeting the science worker at the next career level up: a
team builder who is necessarily more focused on team funding,
management, and marketing.

TASK: with minimum investment of (e.g.) time, money, effort,

1 make project content (e.g., plans, data, visualizations) publicly
  available *over time*: i.e., so as to be

* easily editable (including adding, subtracting, and linking
  artifacts) by authors and collaborators

* easily citable (from, e.g., webpages, email, blogs)

* secure (from vandalism and loss)

  at annual scale (e.g., 1-5 years, or however long a PhD takes
  wherever you are).

2 ease migration/transformation of project content from state="raw"
  (i.e., primarily consumed by one's team) to state="publishable"
  (i.e., suitable for formal submission to formal internal or external
  publishers or reviewers, in the formats demanded by the latter).

STRATEGY: I'm either doing or intending to do this now by

1 using a github wiki to maintain a group of pages and associated
  media relating to the project, including a top-level page (currently

https://github.com/TomRoche/cornbeltN2O/wiki/Simulation-of-N2O-Production-and-Transport-in-the-US-Cornbelt-Compared-to-Tower-Measurements

  ) from which one should be able to discover

* project intent

* project status

* find what one seeks within the project (provided it's there :-)

  Unfortunately github's wiki docs are (IMHO) relatively poor, though,
  since the wiki is a repo, one benefits from the generally-excellent
  git docs.

2 storing and processing data, and generating related products (e.g.,
  visualizations), with maximum automation and reproducibility. This
  entails

2.1 storing both raw and processed data and products in public
    datastores

    I'll need to keep local copies of processable data anyway (notably
    due to latency and processing requirements), and my organization
    has a capable-looking hierarchical filesystem, so I'm not so
    concerned about data security when evaluating public datastores. I
    also understand (better, after the talk) the costs associated with
    DOI minting and maintenance, and must accommodate that. But my
    datasets are fairly large (GBs, not yet TBs), so capacity is
    definitely an issue. So I'm thinking I will

  * publish DOI-worthy data and products at public DOI providers
    (e.g., pangaea.de (earth-science-specific), figshare.com), so as
    to stay within space limits for each provider.

  * publish raw-er data and products, for which DOIs are less
    necessary, at non-DOI-providers such as thedatahub.org et al
    (e.g., github), so as to stay within space limits for each
    provider.

    (datadryad.org also seems useful, but *much* too late in the
    process. Why wait to open-store your data until you have a
    published article? Am I missing something?)

2.2 processing data and generating products using open-source and
    publicly-available code and "engines" (e.g., compilers, VMs)

    I'm only partly compliant: I use mostly R for my own work, but
    generate the base-level data using legacy models which currently
    rely on proprietary compilers.

2.3 managing code with open-source SCMS and public datastores (aka
    repositories).

    git+github seems fine for this; but, as with previous, there are a
    lotta good tools and services in this space. (That being said,
    github has some excellent documents that could be easily
    incorporated into a fastpath.) I write R, but have not been
    packaging it: gotta start doing that.

3 managing references with open-source RMS and public datastores.

  Lotta range here, from all-the-way FOSS (e.g., zotero) to mixed
  public/proprietary (e.g., mendeley, IIUC).

4 generating formally-structured content ("FSC"--e.g., articles,
  posters, presentations) directly (with maximum automation and
  reproducability) from more process-oriented content ("POC"--in the
  above, wiki(s) and datastore(s)).

PROBLEMS: include, more-or-less in order from least to most severe:

1 Gollum, the github's wiki's rendering engine--at least, the
  publicly-deployed version--is butt-ugly. For aesthetics, I miss
  mediawiki, but the version to which I have access is behind a
  firewall, for which it's hard to enable access. (And MW does not
  support interaction via git repo, which I very much like about
  github's wiki. And backing up an MW is still a PITA, at least
  relative to backing up a git repo.)

2 Gollum does not currently support automatic table-of-contents
  generation, which

* makes page navigation *much* more difficult

* requires lots of manual internal-link generation

3 I am *quite* far from an FSC-generation solution--I've been putting
  it off to generate more POC (which make the bosses happy). I need to
  spend some time with Carl Boettiger's stack

http://lists.okfn.org/pipermail/open-science/2012-April/001520.html

  and look @ dexy (nominated by Cameron Neylon)

http://www.dexy.it/

  before some upcoming conferences.

your assistance is appreciated (as is just reading this far :-)