[open-science] github breaks data hosting: alternatives?

Tom Roche Tom_Roche at pobox.com
Wed Dec 12 19:39:30 UTC 2012


[caveat: I'm a student, without much scientific experience. Please
correct the following as needed; your comments are also appreciated.]

Much science informatics involves manipulation of input data, e.g., to
produce visualizations, or just refined data ("analyses," in met-speak)
for another stage in a pipeline. It's therefore useful for open-science
projects to host not only code but its associated inputs and outputs
(here called "I+O"). I found github's Downloads function (and esp
its scriptable Downloads API) convenient for this purpose, e.g., for

https://github.com/TomRoche/GEIA_to_NetCDF

I was able to host the code in the repository, and its I+O in Downloads,
keeping everything together in one project website. However,

https://github.com/blog/1302-goodbye-uploads
> December 11, 2012 
...
> GitHub previously allowed you to upload files (separate from the
> versioned files) in the repository, and make it available for download
> in the Downloads Tab. Supporting these types of uploads was a source
> of great confusion and pain – they were too similar to the files in a
> Git repository. As part of our ongoing effort to keep GitHub focused
> on building software, we are deprecating the Downloads Tab.

> * The ability to upload new files via the web site is disabled today.

> * Existing links to previously uploaded files will continue to work
>   for the foreseeable future.

> * Repositories that already have uploads will continue to list their
>   downloads for the next 90 days (tack on /downloads to the end of any
>   repository to see them).

> * The Downloads API [will] be disabled in 90 days.

So I'll need to migrate my inputs and outputs, or my projects entirely.
Where to? Your suggestions are appreciated. Some options (numbered
solely as they spring to mind) include

1. Bitbucket: also provides free public repos with a Downloads section,
   like the github status quo ante. (IIUC) upload/download is not
   scriptable; OTOH its wikis are much prettier than github's. So I'm
   inclined to migrate my stuff there. Any compelling reasons not to?

2. Github recommends Amazon S3

https://help.github.com/articles/distributing-large-binaries

   This has a "free tier" for up to 5 GB, which would cover me for now.
   It would be more separate from the rest of the project (i.e., the
   code repo) than currently, but that concern may just be purely
   aesthetic.

3. Figshare provides free space <= 1 GB, also separate from the rest of
   the project.

So how would you migrate something like

https://github.com/TomRoche/GEIA_to_NetCDF

presuming you had no external funding (and not much internal funding :-)
and wanted to keep the data close to the code? Note also that my field
is atmospheric (rapidly becoming Earth-system) modeling, so bigger is
definitely better when it comes to data-size limits.

TIA, Tom Roche <Tom_Roche at pobox.com>




More information about the open-science mailing list