[open-science] github/R stack for the nomadic researcher

Carl Boettiger cboettig at gmail.com
Mon Apr 2 03:31:24 UTC 2012


Hi Tom,

Your proposal of github + knitr + markdown sounds good to me.  Here are
some thoughts based largely from my own experience.

Given you're already thinking R code, I'd recommend structuring your
project's repository as an R package.  (definitely see the manual for
details <http://cran.r-project.org/doc/manuals/R-exts.html>)

Several reasons:
1. Easily handles integration and compiling of Fortran code which can be
called from R (for visualization, etc).
2. Provides a natural way to handle dependencies (package dependencies,
config files, etc)
3. Provides a nice standard way to organize the data files, documentation
files, and metadata (license, citation, news, etc).
4. Provides utilities to automate creating documentation, running the unit
tests / error checks, etc.
5. Provides a portable object to share your project with other R users.

When writing your code, I wouldn't write everything in Sweave.  I'd
abstract the main functions and stick them in the R/ directory, with
literate-programming (i.e. roxygen) documentation.  This way they are most
functional and portable.  Some of these functions may just be calling your
fortran code, which can sit in a src directory.

I'd then write knitr markdown files that call these functions with
particular choices of parameters, visualization commands, etc as my
day-to-day code-running exploration.  (A natural place for these in an R
package is /inst/examples, or on the github wiki page of the project if
you'd rather separate them, and use the wiki).

Of course the whole project is a git repository, so each file has full
version history.  Large data files probably need to live elsewhere, as well
as images created when the code runs (knitr can upload automatically for
these, as you probably know).

That's what I try to do anyway -- here's an example of such a
project<https://github.com/cboettig/pdg_control/> structure,
in all it's imperfection.

Publishing to wiki / notebook / journals
Here I describe how this fits into my daily
notebook<http://www.carlboettiger.info/archives/4325>.
 Using markdown as your base, instead of Sweave's .Rnw as you mentioned,
gives you more flexibility with your notes.  They can be displayed nicely,
figures embedded, mathjax and all, on github, posted to a jekyll blog or
other platform, or converted to epub/latex/pdf/doc etc (usually with
pandoc).  Check out the knitr-book  <https://github.com/yihui/knitr-book>for
a great example of this.

On Backup
I think this markdown / portable way will allow you to move and display the
code/writing/figures to the top-level wiki most fluidly.  Use the github
repository to manage backups as well as versions -- you can just keep
copies sync'd across multiple machines, backup drives, servers, using git,
which gives you a nice redundant and distributed backup.

On tools
You asked about choice of IDE, language, etc, and seemed to answer all your
own questions.  Use what you know best and what your colleagues know best.
 When in doubt, go for the most flexible, eh?

On citations
Not sure what's best here, have just been exploring the options myself.  Of
course .Rnw + mendeley-generated bibtex works nicely.  For a markdown based
system, you can add pandoc-flavored markdown for citations (which pandoc
can then format for you in the markdown format), or just use latex-style
citations which will just sit there until you use something like pandoc to
turn markdown to latex/pdf.

I've experimented a bit with dynamic citations (i.e. look up by doi),
making this part of knitr. (some notes
here<http://www.carlboettiger.info/archives/4352>
)


I'd be curious how your experience goes, what you feel works / doesn't
work, and certainly many others on the list have further insights.

-Carl


On Sun, Apr 1, 2012 at 3:16 PM, Tom Roche <Tom_Roche at pobox.com> wrote:

>
> [apologies for length of post, but it's a big topic]
>
> summary: soliciting comment (and answers to questions at bottom) on
> use of a stack based on github and R for recording, presenting, and
> publishing of research, esp for those loosely bound to institutions.
>



>
> details:
>
> I'm a very-early-career researcher (contracting on a project which
> hopefully will also "become" my master's thesis) wanting to do
> open science. As seems usual, I've been taught a lot about my domain of
> inquiry (earth systems) and work (atmospheric modeling), but not so much
> about the mechanics of recording and sharing data, on the continuum of
> increasing formality from enabling public sharing to peer-reviewed
> publication. (It seems one normally absorbs this from one's adviser or
> supervisor, but in my case that's broken, and suspect it's often
> suboptimal for those attempting to open their science.) So, as
> previously noted
>
> http://lists.okfn.org/pipermail/open-science/2012-March/001427.html
>
> I'm attempting to discover tools and workflows for research management
> and sharing as fast as I can, in the time I can take away from research
> (and life-support :-) I've been reading a fair amount, esp about tools
> from/using github and R (and particularly about rOpenSci, though that's
> beyond my current requirements). I'm thinking I could manage my workflow
> using (in increasing level of importance to me, but decreasing level of
> familiarity to me):
>
> 1 Various data/results repositories as needed/suitable (to, e.g.,
>  overcome space limitations, access specialized functionality),
>  whether institutional (i.e., my employer or school) or more public
>  (e.g., arXiv, bitbucket, figshare, flickr, github, googledocs,
>  PANGAEA). (Thanks to Daniel Mietchen for pointer to PANGAEA.)
>
> <aside> Why not just use institutional, esp given that my employer and
> school are both public? That gets to the nomadic (or "precarious")
> requirement. My employer is under severe financial constraint, and I
> am merely a contractor, so I could be out at fairly short notice.
> (Since they pay my tuition, not to mention my food and rent, if I lose
> the contract I probably leave my current school, too, and become
> #scholarlypoor--thanks to Peter Murray-Rust for pointer to that.) My
> work workgroup's workspace is mostly inside a firewall whose admins
> block all but the group with which I'm contracting, and getting
> "outsiders" inside seems to require the proverbial act of Congress, so
> public repos are needed for outside collaboration anyway. However the
> admins @ work are fairly responsive; by contrast, the systems at my
> (school's) department are run as a cost center (e.g., one is expected
> to pay for more than a fairly small amount of space), so the admins
> tend to be as responsive as one pays them to be (which in my case is,
> not at all). Finally, if I hafta leave and find new {work, school,
> degree program}, I'll need means by which to "show my portfolio"
> anyway. </aside>
>
> 2 A top-level public github wiki, where I would
>
> * list my research plan
> * document results
> * "tie together" repositories for data and results (more below)
>
>  and from which I would hopefully
>
> * generate "reports" or "formal content," e.g., slides, presentations,
>  articles
>
>  I'm already using pages on a mediawiki @ work for this purpose,
>  which works well for me and my group, but is inside the savage
>  firewall. I could presumably forklift my pages from the firewalled
>  wiki into github (using markup=mediawiki for now, for convenience
>  and ease of backout), and switch to a more publishing-capable format
>  (more below) later.
>
> 3 Means to generate formal content from the wiki content, or easily
>  transfer content to/from the TLW (top-level wiki) to documents. This
>  is where I'm definitely weakest, but it seems (IIUC) one could do
>  this with, e.g., knitr and markdown. The advantage for me is, I'm
>  already using R for pretty much everything that's not
>
> - actually part of the model on which I work (which is all-fortran)
>
> - easier to do in bash or python (i.e., more OS-level coding)
>
> <aside> Why R and not python for data analysis and assimilation?
> For quality of community, and particularly emphasis on openness, they
> seem pretty equivalent. I've just been using R because
>
> * base R has included most of the functionality I need, but to get
>  that with python requires the Enthought tools, which may be why ...
>
> * the clusters on which I work (but lack sudo) have R, but not EPD
>  (even the free numpy/scipy currently available)
>
> * I work with more R users than python users
>
(These seem like the usual reasons to me.  Use what your colleagues use.

> </aside>
>
> The final item is fortunately not so important to me now, because I'm
> almost entirely ignorant about it now:
>
> 4 More advanced curatorial functions, e.g.
>
> * backup: having stuff so spread-out seems scary, but the individual
>  providers seem trustworthy (famous last words :-) I know how to
>  backup my linux boxes, and do, but have only vague notions regarding
>  the backup of a widely-distributed dataset.
>
> * generating persistent identifiers, e.g., DOIs. URIs will do for now.
>
> Does the above seem feasible using the github/R stack? or are there
> superior alternatives for this usecase? I'm particularly seeking
> answers/doc to the following questions regarding item 3:
>
> 1 What are best practices for most easily/capably generating "content"
>  (e.g., slides, presentations, papers) from the code/data store(s)
>  (i.e. the TLW and repositories to which it points)? E.g.
>
> 1.1 Presuming one's analytic code is mostly R, should one be using
>    knitr, or sweave, or (given that I'm not especially TeX-strong
>    anyway) Something Completely Different?
>
> 1.2 What markup should one use in the TLW? Can one use .tex/.rnw
>    directly? If not, how much does that reduce the utility of the
>    wiki? Alternatively, are there markups which generate .tex/.rnw
>    more or less easily/capably?
>
> 1.3 What are best tools for implementing this process? Emacs, Eclipse,
>    RStudio? Or one of the "lab notebooks" (for a non-wiki
>    implementation)?
>
Why not use whichever you know best?

>
> 2 How to pull in citations/references from external stores (e.g.,
>  mendeley, zotero)? Those are probably the main research inputs which
>  I suspect I either could not, or would not want to, store in the TLW
>  (but ICBW, and if so am open to correction).
>


>
> Feel free to break out separate threads for specific topics/questions,
> move to more-relevant lists/forums, etc.
>
> TIA, Tom Roche <Tom_Roche at pobox.com>
>



-- 
Carl Boettiger
UC Davis
http://www.carlboettiger.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120401/b2160a13/attachment-0001.html>


More information about the open-science mailing list