[open-science] Practical reproducible science; implications for data storage

Wed Apr 11 10:57:42 UTC 2012

On Wed, Apr 11, 2012 at 08:15, Matthew Brett <matthew.brett at gmail.com> wrote:

> My question was - what implication does this have for sites dedicated
> to data storage?  Is practical data storage going to have to live in
> the same network as the cloud CPUs on which we run the analysis?  And
> will the data storage have to include virtual machine images?

Yes.  Computations have to be taken to Big Data (a term whose
applicability gradually shifts over time and depends on context;
currently people generally use it to mean petabyte or above, but
moving the computations to the data is often a good trade-off for
anything over a few tens of gigabytes, or even less).  This requires
providing a compute engine of some sort at the data, and virtualising
that engine is usually the most efficient approach for that (not
always: virtualising an ASIC system such as Anton - used for molecular
dynamics - will lose you several orders of magnitude in performance,
but I'm not aware of anyone offering specialist hardware like that at
big data repositories).  There are lots of different sorts of engines
one might virtualise, and for big datasets one could even devise a
specialised one, but very often the winning strategy is going to be to
follow the mass computational herd and go for a virtualised
intel-based machine (or virtualised cluster thereof).

There are other approaches, depending on the typical computation
dependencies: if it's only going to use thin slices of the data then
you can send queries to the data and stream the slices back to the
computation.  But in many fields (climate science, astronomy,
genomics, doubtless others), people are building infrastructure for
taking computations to data.
-- 
Nick Barnes, Climate Code Foundation, http://climatecode.org/