[open-science] Practical reproducible science; implications for data storage

Matthew Brett matthew.brett at gmail.com
Wed Apr 11 18:20:45 UTC 2012


Hi,

On Wed, Apr 11, 2012 at 3:57 AM, Nick Barnes <nb at climatecode.org> wrote:
> On Wed, Apr 11, 2012 at 08:15, Matthew Brett <matthew.brett at gmail.com> wrote:
>
>> My question was - what implication does this have for sites dedicated
>> to data storage?  Is practical data storage going to have to live in
>> the same network as the cloud CPUs on which we run the analysis?  And
>> will the data storage have to include virtual machine images?
>
> Yes.  Computations have to be taken to Big Data (a term whose
> applicability gradually shifts over time and depends on context;
> currently people generally use it to mean petabyte or above, but
> moving the computations to the data is often a good trade-off for
> anything over a few tens of gigabytes, or even less).  This requires
> providing a compute engine of some sort at the data, and virtualising
> that engine is usually the most efficient approach for that (not
> always: virtualising an ASIC system such as Anton - used for molecular
> dynamics - will lose you several orders of magnitude in performance,
> but I'm not aware of anyone offering specialist hardware like that at
> big data repositories).  There are lots of different sorts of engines
> one might virtualise, and for big datasets one could even devise a
> specialised one, but very often the winning strategy is going to be to
> follow the mass computational herd and go for a virtualised
> intel-based machine (or virtualised cluster thereof).
>
> There are other approaches, depending on the typical computation
> dependencies: if it's only going to use thin slices of the data then
> you can send queries to the data and stream the slices back to the
> computation.  But in many fields (climate science, astronomy,
> genomics, doubtless others), people are building infrastructure for
> taking computations to data.
> --
> Nick Barnes, Climate Code Foundation, http://climatecode.org/

What implications do you see for smaller organizations like the OKF
and sites like http://thedatahub.org/ ?

Best,

Matthew




More information about the open-science mailing list