[okfn-discuss] [open-science] Practical reproducible science; implications for data storage

Wed Apr 11 21:13:15 UTC 2012

Hi,

On Wed, Apr 11, 2012 at 2:07 PM, Nick Barnes <nb at climatecode.org> wrote:
> On Wed, Apr 11, 2012 at 20:45, Matthew Brett <matthew.brett at gmail.com> wrote:
>> Hi,
>>
>> On Wed, Apr 11, 2012 at 12:33 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>>>
>>>
>>> On Wed, Apr 11, 2012 at 7:20 PM, Matthew Brett <matthew.brett at gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Wed, Apr 11, 2012 at 3:57 AM, Nick Barnes <nb at climatecode.org> wrote:
>>>> > On Wed, Apr 11, 2012 at 08:15, Matthew Brett <matthew.brett at gmail.com>
>>>> > wrote:
>>>> >
>>>> >> My question was - what implication does this have for sites dedicated
>>>> >> to data storage?  Is practical data storage going to have to live in
>>>> >> the same network as the cloud CPUs on which we run the analysis?  And
>>>> >> will the data storage have to include virtual machine images?
>>>
>>>
>>> We have had some interesting presentations today - the killer is bandwidth.
>>> CPU and storage are picocents per bit/cycle
>>>
>>>> >
>>>> > Yes.  Computations have to be taken to Big Data (a term whose
>>>> > applicability gradually shifts over time and depends on context;
>>>> > currently people generally use it to mean petabyte or above, but
>>>> > moving the computations to the data is often a good trade-off for
>>>> > anything over a few tens of gigabytes, or even less).  This requires
>>>> > providing a compute engine of some sort at the data, and virtualising
>>>> > that engine is usually the most efficient approach for that (not
>>>> > always: virtualising an ASIC system such as Anton - used for molecular
>>>> > dynamics - will lose you several orders of magnitude in performance,
>>>> > but I'm not aware of anyone offering specialist hardware like that at
>>>> > big data repositories).  There are lots of different sorts of engines
>>>> > one might virtualise, and for big datasets one could even devise a
>>>> > specialised one, but very often the winning strategy is going to be to
>>>> > follow the mass computational herd and go for a virtualised
>>>> > intel-based machine (or virtualised cluster thereof).
>>>> >
>>>
>>> I tend to agree
>>>
>>>>
>>>> > There are other approaches, depending on the typical computation
>>>> > dependencies: if it's only going to use thin slices of the data then
>>>> > you can send queries to the data and stream the slices back to the
>>>> > computation.  But in many fields (climate science, astronomy,
>>>> > genomics, doubtless others), people are building infrastructure for
>>>> > taking computations to data.
>>>
>>>
>>> Yes.
>>>
>>> And is the data static? That's much easy to manage as a one-off cost.
>>>
>>>>
>>>> What implications do you see for smaller organizations like the OKF
>>>> and sites like http://thedatahub.org/ ?
>>>>
>>> I think we have considerable attraction for "PC-sized" datasets. Same
>>> general space as Figshare...
>>
>> I noticed that the data set from the paper I was pointing to was a few
>> GB in size, but still on Amazon because that's where the CPUs were.
>>
>> Will we have to use Amazon for our reproducible science?  If practical
>> reproducibility will need virtual machines and cluster computing, do
>> you see any prospect of this kind of infrastructure being available as
>> an open service?  What about OpenStack?  But then, who will pay for
>> the infrastructure?  Or should universities be funding labs to pay
>> Amazon?
>>
>> Best,
>>
>> Matthew
>
> Yes, probably OpenStack, or a close relation.  There seem to be some
> other candidates, such as CloudStack.
>
> As for who pays, well, if you're going to be doing some computation,
> somebody somewhere is going to pay for the power, the hardware
> depreciation, the infrastructure, the staffing.  As Peter says, the
> cost to the provider is very low: even at Amazon the retail cost is
> fairly low, and the retail prices are surely only going to go down.
> If you group together with other users you can get bulk discounts or
> even operate your own facility.  If you avoid closed-source
> dependencies, you can cheaply migrate to another provider so free
> market should operate efficiently.  I am sure that there are vertical
> market opportunities for entrepreneurs, selling specialised
> data+computation cloud services to science communities, and I presume
> that those niches are already filling up.

Is there anyone you know of who is involved in planning such a system
that would be both open and practical for open science?

Best,

Matthew