[open-science] Practical reproducible science; implications for data storage

Wed Apr 11 07:15:55 UTC 2012

Hi,

I just wanted to point out this paper in case y'all have not seen it:

http://ged.msu.edu/papers/2012-diginorm/

In particular see here for instructions on running the code yourself,
on Amazon's web services:

http://ged.msu.edu/angus/diginorm-2012/pipeline-notes.html

The short story is that the Amazon virtual machine images and public
data storage allow you to run the exact analysis used for the figures
in the paper.

Starcluster [1] and the IPython notebook and parallel machinery [2]
makes it much easier to use Amazon clusters for this purpose.

According to the author, setting this up was relatively easy:

http://ivory.idyll.org/blog/apr-12/replication-i.html

My question was - what implication does this have for sites dedicated
to data storage?  Is practical data storage going to have to live in
the same network as the cloud CPUs on which we run the analysis?  And
will the data storage have to include virtual machine images?

Thanks for any thoughts,

Matthew

[1] STAR: Cluster - http://web.mit.edu/star/cluster/
[2] IPython - http://ipython.org/