[open-science] Practical reproducible science; implications for data storage

Peter Murray-Rust pm286 at cam.ac.uk
Wed Apr 11 19:33:18 UTC 2012


On Wed, Apr 11, 2012 at 7:20 PM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Wed, Apr 11, 2012 at 3:57 AM, Nick Barnes <nb at climatecode.org> wrote:
> > On Wed, Apr 11, 2012 at 08:15, Matthew Brett <matthew.brett at gmail.com>
> wrote:
> >
> >> My question was - what implication does this have for sites dedicated
> >> to data storage?  Is practical data storage going to have to live in
> >> the same network as the cloud CPUs on which we run the analysis?  And
> >> will the data storage have to include virtual machine images?
>

We have had some interesting presentations today - the killer is bandwidth.
CPU and storage are picocents per bit/cycle

> >
> > Yes.  Computations have to be taken to Big Data (a term whose
> > applicability gradually shifts over time and depends on context;
> > currently people generally use it to mean petabyte or above, but
> > moving the computations to the data is often a good trade-off for
> > anything over a few tens of gigabytes, or even less).  This requires
> > providing a compute engine of some sort at the data, and virtualising
> > that engine is usually the most efficient approach for that (not
> > always: virtualising an ASIC system such as Anton - used for molecular
> > dynamics - will lose you several orders of magnitude in performance,
> > but I'm not aware of anyone offering specialist hardware like that at
> > big data repositories).  There are lots of different sorts of engines
> > one might virtualise, and for big datasets one could even devise a
> > specialised one, but very often the winning strategy is going to be to
> > follow the mass computational herd and go for a virtualised
> > intel-based machine (or virtualised cluster thereof).
> >
>
I tend to agree


> > There are other approaches, depending on the typical computation
> > dependencies: if it's only going to use thin slices of the data then
> > you can send queries to the data and stream the slices back to the
> > computation.  But in many fields (climate science, astronomy,
> > genomics, doubtless others), people are building infrastructure for
> > taking computations to data.
>

Yes.

And is the data static? That's much easy to manage as a one-off cost.


> What implications do you see for smaller organizations like the OKF
> and sites like http://thedatahub.org/ ?
>
> I think we have considerable attraction for "PC-sized" datasets. Same
general space as Figshare...




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120411/3a44743c/attachment-0001.html>


More information about the open-science mailing list