[open-science] github/R stack for the nomadic researcher

Tom Roche Tom_Roche at pobox.com
Sun Apr 1 22:16:08 UTC 2012


[apologies for length of post, but it's a big topic]

summary: soliciting comment (and answers to questions at bottom) on
use of a stack based on github and R for recording, presenting, and
publishing of research, esp for those loosely bound to institutions.

details:

I'm a very-early-career researcher (contracting on a project which
hopefully will also "become" my master's thesis) wanting to do 
open science. As seems usual, I've been taught a lot about my domain of
inquiry (earth systems) and work (atmospheric modeling), but not so much
about the mechanics of recording and sharing data, on the continuum of
increasing formality from enabling public sharing to peer-reviewed
publication. (It seems one normally absorbs this from one's adviser or
supervisor, but in my case that's broken, and suspect it's often
suboptimal for those attempting to open their science.) So, as
previously noted

http://lists.okfn.org/pipermail/open-science/2012-March/001427.html

I'm attempting to discover tools and workflows for research management
and sharing as fast as I can, in the time I can take away from research
(and life-support :-) I've been reading a fair amount, esp about tools
from/using github and R (and particularly about rOpenSci, though that's
beyond my current requirements). I'm thinking I could manage my workflow
using (in increasing level of importance to me, but decreasing level of
familiarity to me):

1 Various data/results repositories as needed/suitable (to, e.g.,
  overcome space limitations, access specialized functionality),
  whether institutional (i.e., my employer or school) or more public
  (e.g., arXiv, bitbucket, figshare, flickr, github, googledocs,
  PANGAEA). (Thanks to Daniel Mietchen for pointer to PANGAEA.)

<aside> Why not just use institutional, esp given that my employer and
school are both public? That gets to the nomadic (or "precarious")
requirement. My employer is under severe financial constraint, and I
am merely a contractor, so I could be out at fairly short notice.
(Since they pay my tuition, not to mention my food and rent, if I lose
the contract I probably leave my current school, too, and become
#scholarlypoor--thanks to Peter Murray-Rust for pointer to that.) My
work workgroup's workspace is mostly inside a firewall whose admins
block all but the group with which I'm contracting, and getting
"outsiders" inside seems to require the proverbial act of Congress, so
public repos are needed for outside collaboration anyway. However the
admins @ work are fairly responsive; by contrast, the systems at my
(school's) department are run as a cost center (e.g., one is expected
to pay for more than a fairly small amount of space), so the admins
tend to be as responsive as one pays them to be (which in my case is,
not at all). Finally, if I hafta leave and find new {work, school,
degree program}, I'll need means by which to "show my portfolio"
anyway. </aside>

2 A top-level public github wiki, where I would

* list my research plan
* document results
* "tie together" repositories for data and results (more below)

  and from which I would hopefully 

* generate "reports" or "formal content," e.g., slides, presentations,
  articles

  I'm already using pages on a mediawiki @ work for this purpose,
  which works well for me and my group, but is inside the savage
  firewall. I could presumably forklift my pages from the firewalled
  wiki into github (using markup=mediawiki for now, for convenience
  and ease of backout), and switch to a more publishing-capable format
  (more below) later.

3 Means to generate formal content from the wiki content, or easily
  transfer content to/from the TLW (top-level wiki) to documents. This
  is where I'm definitely weakest, but it seems (IIUC) one could do
  this with, e.g., knitr and markdown. The advantage for me is, I'm
  already using R for pretty much everything that's not

- actually part of the model on which I work (which is all-fortran)

- easier to do in bash or python (i.e., more OS-level coding)

<aside> Why R and not python for data analysis and assimilation? 
For quality of community, and particularly emphasis on openness, they
seem pretty equivalent. I've just been using R because

* base R has included most of the functionality I need, but to get
  that with python requires the Enthought tools, which may be why ...

* the clusters on which I work (but lack sudo) have R, but not EPD
  (even the free numpy/scipy currently available)

* I work with more R users than python users
</aside>

The final item is fortunately not so important to me now, because I'm
almost entirely ignorant about it now:

4 More advanced curatorial functions, e.g.

* backup: having stuff so spread-out seems scary, but the individual
  providers seem trustworthy (famous last words :-) I know how to
  backup my linux boxes, and do, but have only vague notions regarding
  the backup of a widely-distributed dataset.

* generating persistent identifiers, e.g., DOIs. URIs will do for now.

Does the above seem feasible using the github/R stack? or are there
superior alternatives for this usecase? I'm particularly seeking
answers/doc to the following questions regarding item 3:

1 What are best practices for most easily/capably generating "content"
  (e.g., slides, presentations, papers) from the code/data store(s)
  (i.e. the TLW and repositories to which it points)? E.g.

1.1 Presuming one's analytic code is mostly R, should one be using
    knitr, or sweave, or (given that I'm not especially TeX-strong
    anyway) Something Completely Different?

1.2 What markup should one use in the TLW? Can one use .tex/.rnw
    directly? If not, how much does that reduce the utility of the
    wiki? Alternatively, are there markups which generate .tex/.rnw
    more or less easily/capably?

1.3 What are best tools for implementing this process? Emacs, Eclipse,
    RStudio? Or one of the "lab notebooks" (for a non-wiki
    implementation)?

2 How to pull in citations/references from external stores (e.g.,
  mendeley, zotero)? Those are probably the main research inputs which
  I suspect I either could not, or would not want to, store in the TLW
  (but ICBW, and if so am open to correction).

Feel free to break out separate threads for specific topics/questions,
move to more-relevant lists/forums, etc.

TIA, Tom Roche <Tom_Roche at pobox.com>




More information about the open-science mailing list