[open-science] Practical reproducible science; implications for data storage

Matthew Brett matthew.brett at gmail.com
Wed Apr 11 19:45:24 UTC 2012


Hi,

On Wed, Apr 11, 2012 at 12:33 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>
>
> On Wed, Apr 11, 2012 at 7:20 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> Hi,
>>
>> On Wed, Apr 11, 2012 at 3:57 AM, Nick Barnes <nb at climatecode.org> wrote:
>> > On Wed, Apr 11, 2012 at 08:15, Matthew Brett <matthew.brett at gmail.com>
>> > wrote:
>> >
>> >> My question was - what implication does this have for sites dedicated
>> >> to data storage?  Is practical data storage going to have to live in
>> >> the same network as the cloud CPUs on which we run the analysis?  And
>> >> will the data storage have to include virtual machine images?
>
>
> We have had some interesting presentations today - the killer is bandwidth.
> CPU and storage are picocents per bit/cycle
>
>> >
>> > Yes.  Computations have to be taken to Big Data (a term whose
>> > applicability gradually shifts over time and depends on context;
>> > currently people generally use it to mean petabyte or above, but
>> > moving the computations to the data is often a good trade-off for
>> > anything over a few tens of gigabytes, or even less).  This requires
>> > providing a compute engine of some sort at the data, and virtualising
>> > that engine is usually the most efficient approach for that (not
>> > always: virtualising an ASIC system such as Anton - used for molecular
>> > dynamics - will lose you several orders of magnitude in performance,
>> > but I'm not aware of anyone offering specialist hardware like that at
>> > big data repositories).  There are lots of different sorts of engines
>> > one might virtualise, and for big datasets one could even devise a
>> > specialised one, but very often the winning strategy is going to be to
>> > follow the mass computational herd and go for a virtualised
>> > intel-based machine (or virtualised cluster thereof).
>> >
>
> I tend to agree
>
>>
>> > There are other approaches, depending on the typical computation
>> > dependencies: if it's only going to use thin slices of the data then
>> > you can send queries to the data and stream the slices back to the
>> > computation.  But in many fields (climate science, astronomy,
>> > genomics, doubtless others), people are building infrastructure for
>> > taking computations to data.
>
>
> Yes.
>
> And is the data static? That's much easy to manage as a one-off cost.
>
>>
>> What implications do you see for smaller organizations like the OKF
>> and sites like http://thedatahub.org/ ?
>>
> I think we have considerable attraction for "PC-sized" datasets. Same
> general space as Figshare...

I noticed that the data set from the paper I was pointing to was a few
GB in size, but still on Amazon because that's where the CPUs were.

Will we have to use Amazon for our reproducible science?  If practical
reproducibility will need virtual machines and cluster computing, do
you see any prospect of this kind of infrastructure being available as
an open service?  What about OpenStack?  But then, who will pay for
the infrastructure?  Or should universities be funding labs to pay
Amazon?

Best,

Matthew




More information about the open-science mailing list