[open-science] toward a low-overhead fastpath for open-science "publishing"

Peter Murray-Rust pm286 at cam.ac.uk
Fri Aug 3 07:43:21 UTC 2012

This is great Tom,

On Fri, Aug 3, 2012 at 1:50 AM, Tom Roche <Tom_Roche at pobox.com> wrote:

> summary: let's produce something like a fastpath (more below)
> demonstrating an open-science "publishing" (more below) process
> suitable for use by the very-early-career or #scholarlypoor
> aspiring scientist. I sketch parts of one (below the line).
> details:

> Thanks all for your interesting talk today
> http://ckan.okfnpad.org/meetup-2012-08-02
> Since there seemed to be interest (at least, from me :-),

and from me!

> and since I
> think it's a fairly general usecase that might be useful for (e.g.)
> tutorials/howto's/etc, and since I'd definitely appreciate suggestions
> for improvement (esp of ease-of-use :-), here's some more detail about
> what I'm trying to do, and how. (Got a better way to communicate/
> collaborate on this particular topic? Please lemme know.)
> I am sure Sophie Kershaw - Panton Fellow - reads this but I've copied her
in as she is running a 3- week training course on data management  for
starting PhDs.

What I hope will emerge from this (eventually :-) is a "fastpath " for
> open-science "publishing" suitable for folks like me:
> "Fastpath" (a term from my years coding for a massive three-letter
> acronym) refers to a document for a demo. (Or the script for a demo
> video.) A fastpath is one level below a howto--even more hands on,
> with even less explanation: "type this, then click that, and you'll
> get this thing." Fastpaths instill confidence (to the extent they are
> completely replicable), and can convince the audience to learn more.
> (Notably, to RTFM--which the fastpath's intended audience is usually
> loathe to do.) Note that a fastpath doesn't need to be, and usually
> isn't, the "absolute best way" to do something (which usually varies
> by practitioner): it's just so drop-dead easy

for the users! It isn't drop-dead easy to create!

> and robust (and is
> therefore usually designed before, and developed with, the code it
> documents) "even Marketing can run it." FWIW, one part of an
> open-science publishing fastpath should definitely be
> http://trac.ckan.org/ticket/2796
> I put "publishing" in quotes because the domain I target includes many
> activities (e.g., logging, collaboration) that terminate (more or
> less) in generation of formally-structured content ("FSC"). The other
> activities are unfortunately disrespected relative to that which gets
> funding, tenure, etc--such is life.

Yes. This is a major problem but at least funders are talking about it.

> And, in my own self-interest, I'm also definitely interested in
> assistance and mentoring with my own effort in this space, which
> follows; your comments/suggestions are appreciated.
> ----------------------------------------------------------------------
> AUDIENCE: very-early-career science students (I'm a master's student)
> and workers who
> 1 are reasonably "neterate," i.e., can work with fairly high-level
>   descriptions of internet-available sites/services and the tools/
>   protocols available for their access
> 2 must publish (if only a thesis) to "get ahead" (i.e., the
>   second-lowest level of the precariat), but haven't yet (or not
>   enough)
> 3 want to "be open," i.e., to have their project's plans, data, and
>   artifacts publicly available (minimally, for collaboration). Extra
>   credit for exaltation of "replication in computational science"
> http://ivory.idyll.org/blog/replication-i.html
>   (and thanks to Cameron Neylon for passing that pointer)
> 4 are #scholarlypoor in
> - time, notably because they're so busy learning the domain about
>   which they need to publish that they lack time to invest in
>   optimizing publishing tools and processes
> - money, to pay someone(s) to handle the more strictly publishing-
>   oriented tasks which they might prefer to offload.
> Note my audience is more early-career, and my task (below) more
> individually focused, than would suit someone like Ethan Perlstein,
> whose Journal Wordpress
> http://lists.okfn.org/pipermail/open-science/2012-July/001754.html
> I see as targeting the science worker at the next career level up: a
> team builder who is necessarily more focused on team funding,
> management, and marketing.
> TASK: with minimum investment of (e.g.) time, money, effort,
> 1 make project content (e.g., plans, data, visualizations) publicly
>   available *over time*: i.e., so as to be
> * easily editable (including adding, subtracting, and linking
>   artifacts) by authors and collaborators
> * easily citable (from, e.g., webpages, email, blogs)
> * secure (from vandalism and loss)
>   at annual scale (e.g., 1-5 years, or however long a PhD takes
>   wherever you are).
> 2 ease migration/transformation of project content from state="raw"
>   (i.e., primarily consumed by one's team) to state="publishable"
>   (i.e., suitable for formal submission to formal internal or external
>   publishers or reviewers, in the formats demanded by the latter).
> STRATEGY: I'm either doing or intending to do this now by
> 1 using a github wiki to maintain a group of pages and associated
>   media relating to the project, including a top-level page (currently
> https://github.com/TomRoche/cornbeltN2O/wiki/Simulation-of-N2O-Production-and-Transport-in-the-US-Cornbelt-Compared-to-Tower-Measurements
>   ) from which one should be able to discover
> * project intent
> * project status
> * find what one seeks within the project (provided it's there :-)
>   Unfortunately github's wiki docs are (IMHO) relatively poor, though,
>   since the wiki is a repo, one benefits from the generally-excellent
>   git docs.
> 2 storing and processing data, and generating related products (e.g.,
>   visualizations), with maximum automation and reproducibility. This
>   entails
> 2.1 storing both raw and processed data and products in public
>     datastores
>     I'll need to keep local copies of processable data anyway (notably
>     due to latency and processing requirements), and my organization
>     has a capable-looking hierarchical filesystem, so I'm not so
>     concerned about data security when evaluating public datastores. I
>     also understand (better, after the talk) the costs associated with
>     DOI minting and maintenance, and must accommodate that. But my
>     datasets are fairly large (GBs, not yet TBs), so capacity is
>     definitely an issue. So I'm thinking I will
>   * publish DOI-worthy data and products at public DOI providers
>     (e.g., pangaea.de (earth-science-specific), figshare.com), so as
>     to stay within space limits for each provider.
>   * publish raw-er data and products, for which DOIs are less
>     necessary, at non-DOI-providers such as thedatahub.org et al
>     (e.g., github), so as to stay within space limits for each
>     provider.
>     (datadryad.org also seems useful, but *much* too late in the
>     process. Why wait to open-store your data until you have a
>     published article? Am I missing something?)
> 2.2 processing data and generating products using open-source and
>     publicly-available code and "engines" (e.g., compilers, VMs)
>     I'm only partly compliant: I use mostly R for my own work, but
>     generate the base-level data using legacy models which currently
>     rely on proprietary compilers.
> 2.3 managing code with open-source SCMS and public datastores (aka
>     repositories).
>     git+github seems fine for this; but, as with previous, there are a
>     lotta good tools and services in this space. (That being said,
>     github has some excellent documents that could be easily
>     incorporated into a fastpath.) I write R, but have not been
>     packaging it: gotta start doing that.
> 3 managing references with open-source RMS and public datastores.
>   Lotta range here, from all-the-way FOSS (e.g., zotero) to mixed
>   public/proprietary (e.g., mendeley, IIUC).
> 4 generating formally-structured content ("FSC"--e.g., articles,
>   posters, presentations) directly (with maximum automation and
>   reproducability) from more process-oriented content ("POC"--in the
>   above, wiki(s) and datastore(s)).
> PROBLEMS: include, more-or-less in order from least to most severe:
> 1 Gollum, the github's wiki's rendering engine--at least, the
>   publicly-deployed version--is butt-ugly. For aesthetics, I miss
>   mediawiki, but the version to which I have access is behind a
>   firewall, for which it's hard to enable access. (And MW does not
>   support interaction via git repo, which I very much like about
>   github's wiki. And backing up an MW is still a PITA, at least
>   relative to backing up a git repo.)
> 2 Gollum does not currently support automatic table-of-contents
>   generation, which
> * makes page navigation *much* more difficult
> * requires lots of manual internal-link generation
> 3 I am *quite* far from an FSC-generation solution--I've been putting
>   it off to generate more POC (which make the bosses happy). I need to
>   spend some time with Carl Boettiger's stack
> http://lists.okfn.org/pipermail/open-science/2012-April/001520.html
>   and look @ dexy (nominated by Cameron Neylon)

Good choice

> http://www.dexy.it/
>   before some upcoming conferences.
> your assistance is appreciated (as is just reading this far :-)

I have read it. There's an awful lot and it will take a lot of coordination
as well as implementation. Can it be done in stages?

And not also that there are several groups thinking like you who are doing
bottom-up things. Much of this may be integratoin - and from my own
experience that takes time.

But good luck

> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120803/389e34b3/attachment-0001.html>

More information about the open-science mailing list