[open-science] github/R stack for the nomadic researcher

Carl Boettiger cboettig at gmail.com
Wed Apr 11 18:40:02 UTC 2012


To carry this thread a step further...

I suspect most researchers will be apt to use something if they can be
convinced that it's already accepted for use by the scientific community --
whether or not it's specifically designed for scientists.  I'd agree with
Jessy -- the internet is littered with projects that have tried to be a
"facebook for scientists" a "github for scientists" etc, which are largely
unsuccessful when the impression is that "no researcher I know & respect
uses that". Meanwhile, I've seen many skeptics go from "know
self-respecting researcher would use twitter" to "look at all these
well-respected researchers using twitter!  I better get on the boat"

I believe that having funders and journals visibly support, recommend, or
require the use of a particular repository is also essential.  This has
certainly helped repositories such as Dryad and figshare.  Meanwhile,
alternate solutions will have to distinguish themselves from other players
in this field, and the different repositories need to interact seemlessly
so researchers don't have to worry about which one to submit to.  The
DataONE project is probably a leading example of these efforts.

So, perhaps we don't need a thedatahub.org for science, perhaps we do.
What we really need, though, is established researchers using the service
and funders and publishers recommending or requiring it.

The Amazon S3 business mentioned on another recent thread is probably an
important as well, at least for very large datasets.  With examples like
Titus's, or more visibly, NIH's recent announcement of freely hosting the 1000
Genomes dataset <http://aws.amazon.com/1000genomes/> there (200 terabytes),
this seems like an important player.  Perhaps thedatahub.org could simply
provide the option for data to post to Amazon S3, instead of their own
servers?  Alternatively something like
globusonline<https://www.globusonline.org/>with its terabyte/hr
transfer rates could help to approach big data the old
fashion way (as used by the US DOE supercomputing centers to share data).

I've just started exploring thedatahub.org myself.  The API (essential if I
am to be able to incorporate this into my workflow) looks great but is
giving me a few challenges.  AmazonS3 has the advantage of a much wider
developer community building tools to interact with its data storage,
though somewhat of a disadvantage in cost...  I'd be curious to here the
experiences of others...


- Carl

On Wed, Apr 11, 2012 at 11:03 AM, Jessy Kate Schingler
<jessy at jessykate.com>wrote:

> to play devil's advocate :)
>
> i think sites like github and wordpress and all the other defacto hosted
> tools are successful specifically *because* they cross community
> boundaries, and as a result encourage cross pollination and collaboration,
> and focus efforts and (human/dollar) support. if there were are data hubs
> for each possible community, then i'm worried we just end up with
> fragmentation of efforts, and confusion on the part of the user ("gee, do i
> post to the CS data hub or the web development data hub? oh whatever i'll
> just do it later.").
>
> on the other hand, as scientists posting to sites like thedatahub, we
> actually increase exposure to our data and probability of re-use/re-mixing,
> and hopefully help to dispel the notion that there is anything mysterious
> or special about "real" scientists' data. we're right in there with the
> data nerds and the software developers and the database admins and the non
> profits and the inter-governmentals sharing and refining and asking
> questions about their data. seems better for us all...
>
> to be clear, if we need more human resources to support operations or
> scale up what thedatahub.org is capable of handling, i think we should
> definitely do that and am happy to help on the sysadmin side, but IMHO we
> would reap greater rewards by creating a defacto place on the web that does
> this well for all, than by setting up a separate community.
>
> my 2c!
> jessy
>
>
> On Wed, Apr 11, 2012 at 7:48 AM, Peter Murray-Rust <pm286 at cam.ac.uk>wrote:
>
>>
>>
>> On Wed, Apr 11, 2012 at 7:45 AM, Jessy Kate Schingler <
>> jessy at jessykate.com> wrote:
>>
>>> do people think a separate instance of ckan would be useful for the open
>>> data/science community at large? or is it an issue of marketing what we
>>> have (thedatahub) better?
>>>
>>> if the former, i'm happy to help w system administration, but it's not
>>> obvious to me... curious what others think!
>>>
>>>
>> I think we should have a separate science-datahub.. I showed datahub to
>> the European Horizon2020 today - very briefly..
>>
>>
>>> jessy
>>>
>>>
>>> On Tue, Apr 10, 2012 at 1:25 AM, Mark Wainwright <
>>> mark.wainwright at okfn.org> wrote:
>>>
>>>> Yes indeed! Perhaps I could mention this submission that I threw
>>>> together for the Open Repositories conference OR12
>>>> (http://or2012.ed.ac.uk):
>>>>
>>>> http://ckan.okfnpad.org/or12
>>>>
>>>> My idea was that we could boot a new instance of ckan specialised for
>>>> research papers (slightly facetiously called thepaperhub.org), but I
>>>> don't know how easy this is, or whether there would be enthusiasm from
>>>> someone technically literate to keep it running. (Volunteers?)
>>>> Meantime thedatahub.org is a good option.
>>>>
>>>> I gather OR12 will be accepting/rejecting submissions on 16 April,
>>>> incidentally.
>>>>
>>>> Mark
>>>>
>>>>
>>>> On 2 April 2012 20:01, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
>>>>
>>>> > On Mon, Apr 2, 2012 at 7:23 PM, Jessy Kate Schingler <
>>>> jessy at jessykate.com>
>>>> > wrote:
>>>> >>
>>>> >> i agree on the dataforge front...  git doesn't handle large files
>>>> well,
>>>> >> and figshare, buzzdata etc. seem to be mostly for visual or tabular
>>>> data
>>>> >> sets. out of curiosity, as i'm starting to learn about
>>>> thedatahub.com,
>>>> >
>>>> >
>>>> > thedatahub.org I think
>>>> >
>>>> >>
>>>> >> it seems rather perfect for data set management, and even has a
>>>> change
>>>> >> lists for data sets, groups, user pages, etc. (especially if there
>>>> were some
>>>> >> command line tools so i could "commit" changes to my data set
>>>> periodically
>>>> >> and upload them :)).
>>>> >>
>>>> >> is there a reason people find ckan/thedatahub insufficient for data
>>>> >> management needs? is it related to technical/features, or to peoples'
>>>> >> familiarity and confidence around the longevity of the site?
>>>> >
>>>> >
>>>> > It's history, I think. We should now be making the case for such a
>>>> > repository and I don't think Figshare is it. I have rather negelected
>>>> > datahub because the original CKAN was metadata-oriented.
>>>> >
>>>> > I'll be making the case in Europe next week that we badly need
>>>> informal
>>>> > repositories and maybe this is the time to push the datahub?
>>>> >
>>>> > P.
>>>> >
>>>> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 2, 2012 at 12:05 AM, Peter Murray-Rust <pm286 at cam.ac.uk>
>>>> >> wrote:
>>>> >>>
>>>> >>> Tom,
>>>> >>> This is a really valuable post. I feel your concerns directly. I
>>>> have
>>>> >>> copied in our new Panton fellows (though I am sure they read this
>>>> list
>>>> >>> anyway!)
>>>> >>>
>>>> >>> On Sun, Apr 1, 2012 at 11:16 PM, Tom Roche <Tom_Roche at pobox.com>
>>>> wrote:
>>>> >>>>
>>>> >>>>
>>>> >>>> [apologies for length of post, but it's a big topic]
>>>> >>>
>>>> >>>
>>>> >>> No apologies needed!
>>>> >>>
>>>> >>> I am giving an important presentation to  Europe "Open
>>>> Infrastructures
>>>> >>> for Open Science" and Neelie Kroes and others will be there. I am
>>>> getting my
>>>> >>> thoughts together as I have to give the plenary that informs the
>>>> rest of the
>>>> >>> workshop. Currently my thoughts are:
>>>> >>>
>>>> >>> Europe (and the world) is losing 10 billion + in unused and
>>>> restricted
>>>> >>> data. (I said this to Hargreaves)
>>>> >>> We MUST have easily accessible research repositories, probably on a
>>>> >>> domain basis (Dryad, Pangaea, TARDIS, etc.)
>>>> >>> Institutional Repos do not work for STM and never will
>>>> >>> Mandates are a blunt weapon and so far have little effectiveness
>>>> >>> Non-Commercial destroys knowledge
>>>> >>>
>>>> >>> We must give the researchers something they want. Sourceforge does
>>>> this
>>>> >>> for code. I use Sourceforge (actually now Bitbucket and Github)
>>>> several
>>>> >>> times a day. All my code is backed up, shareable, reusable,
>>>> validated etc.
>>>> >>>
>>>> >>> There must be a "Data forge" for Europe. Figshare was built by one
>>>> >>> graduate student in one year. I would give 3rd year graduate
>>>> students
>>>> >>> funding to do this - it's a hundred times more cost effective than
>>>> >>> repositories.
>>>> >>>
>>>> >>> I'd like to collect ideas on this llist and present them next week
>>>> >>> (11th). An OKF data manifesto for Open Science (in Europe) Who
>>>> knows what
>>>> >>> might come?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>>
>>>> >>>>
>>>> >>> --
>>>> >>> Peter Murray-Rust
>>>> >>> Reader in Molecular Informatics
>>>> >>> Unilever Centre, Dep. Of Chemistry
>>>> >>> University of Cambridge
>>>> >>> CB2 1EW, UK
>>>> >>> +44-1223-763069
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> open-science mailing list
>>>> >>> open-science at lists.okfn.org
>>>> >>> http://lists.okfn.org/mailman/listinfo/open-science
>>>> >>>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Jessy
>>>> >> http://jessykate.com
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Peter Murray-Rust
>>>> > Reader in Molecular Informatics
>>>> > Unilever Centre, Dep. Of Chemistry
>>>> > University of Cambridge
>>>> > CB2 1EW, UK
>>>> > +44-1223-763069
>>>> >
>>>> > _______________________________________________
>>>> > open-science mailing list
>>>> > open-science at lists.okfn.org
>>>> > http://lists.okfn.org/mailman/listinfo/open-science
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Mark Wainwright, CKAN Community Co-ordinator
>>>> Open Knowledge Foundation http://okfn.org/
>>>> Skype: m.wainwright
>>>>
>>>> _______________________________________________
>>>> open-science mailing list
>>>> open-science at lists.okfn.org
>>>> http://lists.okfn.org/mailman/listinfo/open-science
>>>>
>>>
>>>
>>>
>>> --
>>> Jessy
>>> http://jessykate.com
>>>
>>>
>>> _______________________________________________
>>> open-science mailing list
>>> open-science at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/open-science
>>>
>>>
>>
>>
>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
>>
>
>
>
> --
> Jessy
> http://jessykate.com
>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
>
>


-- 
Carl Boettiger
UC Davis
http://www.carlboettiger.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20120411/045c60eb/attachment-0001.html>


More information about the open-science mailing list