[open-science] github breaks data hosting: alternatives?

Matt Jones jones at nceas.ucsb.edu
Wed Dec 12 22:31:38 UTC 2012


Tom --

Another option is to use one of the open community repositories dedicated
for science data archival and preservation.  For example, most of the
DataONE enabled repositories like the KNB and ORNL DAAC enable long-term
storage of data with an eye towards replication and reliability over
multiple decades (the KNB is 14 years old, and the ORNLDAAC is 16 years
old).  Using the DataONE API, you reference the globally unique identifier
for a data set (such as a DOI) in your R scripts, rather than a specific
file location or web location.  This allows the script to remain portable
and functional across systems, even when used in differing environments or
when the location of the data changes over time (because the DOI is used to
resolve the current location of the data, not the historical location).
 The DataONE R library is in testing now, and we'd love to have some folks
take a look at it if you are interested.

Matt



On Wed, Dec 12, 2012 at 12:51 PM, Carl Boettiger <cboettig at gmail.com> wrote:

> Hi Tom,
>
> The Github change is really a change of interface rather than a change of
> functionality.  You could always commit the data files to a directory in
> the repository (particularly if they are text files such as netCDF, but
> even if they are binary) and link them from the repo's readme.   (If they
> are binaries you probably want to add `?raw=true` at the end of the url so
> the browser attempts to download the file instead of view on github, e.g.
> https://github.com/cboettig/labnotebook/blob/master/assets/files/coloredNoise.pdf?raw=true).
>
>
> Figshare provides unlimited space for public files, it just limits the
> size of individual files (250mb/per file).  It is only the private upload
> space that is limited to 1GB.  If I had big datasets I wanted to make
> public, I'd probably stick them here and link them clearly from the
> README.md on Github.  (Or if you go the S3 route, or CKAN or
> personal/institutional server, you could still add the link to keep things
> "in one place".  Likewise simply link back from the figshare page to the
> github repo, etc.)
>
> - Carl
>
>
>
>
> On Wed, Dec 12, 2012 at 11:39 AM, Tom Roche <Tom_Roche at pobox.com> wrote:
>
>>
>> [caveat: I'm a student, without much scientific experience. Please
>> correct the following as needed; your comments are also appreciated.]
>>
>> Much science informatics involves manipulation of input data, e.g., to
>> produce visualizations, or just refined data ("analyses," in met-speak)
>> for another stage in a pipeline. It's therefore useful for open-science
>> projects to host not only code but its associated inputs and outputs
>> (here called "I+O"). I found github's Downloads function (and esp
>> its scriptable Downloads API) convenient for this purpose, e.g., for
>>
>> https://github.com/TomRoche/GEIA_to_NetCDF
>>
>> I was able to host the code in the repository, and its I+O in Downloads,
>> keeping everything together in one project website. However,
>>
>> https://github.com/blog/1302-goodbye-uploads
>> > December 11, 2012
>> ...
>> > GitHub previously allowed you to upload files (separate from the
>> > versioned files) in the repository, and make it available for download
>> > in the Downloads Tab. Supporting these types of uploads was a source
>> > of great confusion and pain – they were too similar to the files in a
>> > Git repository. As part of our ongoing effort to keep GitHub focused
>> > on building software, we are deprecating the Downloads Tab.
>>
>> > * The ability to upload new files via the web site is disabled today.
>>
>> > * Existing links to previously uploaded files will continue to work
>> >   for the foreseeable future.
>>
>> > * Repositories that already have uploads will continue to list their
>> >   downloads for the next 90 days (tack on /downloads to the end of any
>> >   repository to see them).
>>
>> > * The Downloads API [will] be disabled in 90 days.
>>
>> So I'll need to migrate my inputs and outputs, or my projects entirely.
>> Where to? Your suggestions are appreciated. Some options (numbered
>> solely as they spring to mind) include
>>
>> 1. Bitbucket: also provides free public repos with a Downloads section,
>>    like the github status quo ante. (IIUC) upload/download is not
>>    scriptable; OTOH its wikis are much prettier than github's. So I'm
>>    inclined to migrate my stuff there. Any compelling reasons not to?
>>
>> 2. Github recommends Amazon S3
>>
>> https://help.github.com/articles/distributing-large-binaries
>>
>>    This has a "free tier" for up to 5 GB, which would cover me for now.
>>    It would be more separate from the rest of the project (i.e., the
>>    code repo) than currently, but that concern may just be purely
>>    aesthetic.
>>
>> 3. Figshare provides free space <= 1 GB, also separate from the rest of
>>    the project.
>>
>> So how would you migrate something like
>>
>> https://github.com/TomRoche/GEIA_to_NetCDF
>>
>> presuming you had no external funding (and not much internal funding :-)
>> and wanted to keep the data close to the code? Note also that my field
>> is atmospheric (rapidly becoming Earth-system) modeling, so bigger is
>> definitely better when it comes to data-size limits.
>>
>> TIA, Tom Roche <Tom_Roche at pobox.com>
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>>
>
>
>
> --
> Carl Boettiger
> UC Santa Cruz
> http://www.carlboettiger.info/
>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121212/360d6ce9/attachment-0001.html>


More information about the open-science mailing list