[open-science] Practical reproducible science; implications for data storage

Wed Apr 11 21:07:58 UTC 2012

On Wed, Apr 11, 2012 at 20:45, Matthew Brett <matthew.brett at gmail.com> wrote:
> Hi,
>
> On Wed, Apr 11, 2012 at 12:33 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>>
>>
>> On Wed, Apr 11, 2012 at 7:20 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> On Wed, Apr 11, 2012 at 3:57 AM, Nick Barnes <nb at climatecode.org> wrote:
>>> > On Wed, Apr 11, 2012 at 08:15, Matthew Brett <matthew.brett at gmail.com>
>>> > wrote:
>>> >
>>> >> My question was - what implication does this have for sites dedicated
>>> >> to data storage?  Is practical data storage going to have to live in
>>> >> the same network as the cloud CPUs on which we run the analysis?  And
>>> >> will the data storage have to include virtual machine images?
>>
>>
>> We have had some interesting presentations today - the killer is bandwidth.
>> CPU and storage are picocents per bit/cycle
>>
>>> >
>>> > Yes.  Computations have to be taken to Big Data (a term whose
>>> > applicability gradually shifts over time and depends on context;
>>> > currently people generally use it to mean petabyte or above, but
>>> > moving the computations to the data is often a good trade-off for
>>> > anything over a few tens of gigabytes, or even less).  This requires
>>> > providing a compute engine of some sort at the data, and virtualising
>>> > that engine is usually the most efficient approach for that (not
>>> > always: virtualising an ASIC system such as Anton - used for molecular
>>> > dynamics - will lose you several orders of magnitude in performance,
>>> > but I'm not aware of anyone offering specialist hardware like that at
>>> > big data repositories).  There are lots of different sorts of engines
>>> > one might virtualise, and for big datasets one could even devise a
>>> > specialised one, but very often the winning strategy is going to be to
>>> > follow the mass computational herd and go for a virtualised
>>> > intel-based machine (or virtualised cluster thereof).
>>> >
>>
>> I tend to agree
>>
>>>
>>> > There are other approaches, depending on the typical computation
>>> > dependencies: if it's only going to use thin slices of the data then
>>> > you can send queries to the data and stream the slices back to the
>>> > computation.  But in many fields (climate science, astronomy,
>>> > genomics, doubtless others), people are building infrastructure for
>>> > taking computations to data.
>>
>>
>> Yes.
>>
>> And is the data static? That's much easy to manage as a one-off cost.
>>
>>>
>>> What implications do you see for smaller organizations like the OKF
>>> and sites like http://thedatahub.org/ ?
>>>
>> I think we have considerable attraction for "PC-sized" datasets. Same
>> general space as Figshare...
>
> I noticed that the data set from the paper I was pointing to was a few
> GB in size, but still on Amazon because that's where the CPUs were.
>
> Will we have to use Amazon for our reproducible science?  If practical
> reproducibility will need virtual machines and cluster computing, do
> you see any prospect of this kind of infrastructure being available as
> an open service?  What about OpenStack?  But then, who will pay for
> the infrastructure?  Or should universities be funding labs to pay
> Amazon?
>
> Best,
>
> Matthew

Yes, probably OpenStack, or a close relation.  There seem to be some
other candidates, such as CloudStack.

As for who pays, well, if you're going to be doing some computation,
somebody somewhere is going to pay for the power, the hardware
depreciation, the infrastructure, the staffing.  As Peter says, the
cost to the provider is very low: even at Amazon the retail cost is
fairly low, and the retail prices are surely only going to go down.
If you group together with other users you can get bulk discounts or
even operate your own facility.  If you avoid closed-source
dependencies, you can cheaply migrate to another provider so free
market should operate efficiently.  I am sure that there are vertical
market opportunities for entrepreneurs, selling specialised
data+computation cloud services to science communities, and I presume
that those niches are already filling up.
-- 
Nick Barnes, Climate Code Foundation, http://climatecode.org/