[open-science] github breaks data hosting: alternatives?

Carl Boettiger cboettig at gmail.com
Wed Dec 12 21:51:02 UTC 2012


Hi Tom,

The Github change is really a change of interface rather than a change of
functionality.  You could always commit the data files to a directory in
the repository (particularly if they are text files such as netCDF, but
even if they are binary) and link them from the repo's readme.   (If they
are binaries you probably want to add `?raw=true` at the end of the url so
the browser attempts to download the file instead of view on github, e.g.
https://github.com/cboettig/labnotebook/blob/master/assets/files/coloredNoise.pdf?raw=true).


Figshare provides unlimited space for public files, it just limits the size
of individual files (250mb/per file).  It is only the private upload space
that is limited to 1GB.  If I had big datasets I wanted to make public, I'd
probably stick them here and link them clearly from the README.md on
Github.  (Or if you go the S3 route, or CKAN or personal/institutional
server, you could still add the link to keep things "in one place".
 Likewise simply link back from the figshare page to the github repo, etc.)

- Carl




On Wed, Dec 12, 2012 at 11:39 AM, Tom Roche <Tom_Roche at pobox.com> wrote:

>
> [caveat: I'm a student, without much scientific experience. Please
> correct the following as needed; your comments are also appreciated.]
>
> Much science informatics involves manipulation of input data, e.g., to
> produce visualizations, or just refined data ("analyses," in met-speak)
> for another stage in a pipeline. It's therefore useful for open-science
> projects to host not only code but its associated inputs and outputs
> (here called "I+O"). I found github's Downloads function (and esp
> its scriptable Downloads API) convenient for this purpose, e.g., for
>
> https://github.com/TomRoche/GEIA_to_NetCDF
>
> I was able to host the code in the repository, and its I+O in Downloads,
> keeping everything together in one project website. However,
>
> https://github.com/blog/1302-goodbye-uploads
> > December 11, 2012
> ...
> > GitHub previously allowed you to upload files (separate from the
> > versioned files) in the repository, and make it available for download
> > in the Downloads Tab. Supporting these types of uploads was a source
> > of great confusion and pain – they were too similar to the files in a
> > Git repository. As part of our ongoing effort to keep GitHub focused
> > on building software, we are deprecating the Downloads Tab.
>
> > * The ability to upload new files via the web site is disabled today.
>
> > * Existing links to previously uploaded files will continue to work
> >   for the foreseeable future.
>
> > * Repositories that already have uploads will continue to list their
> >   downloads for the next 90 days (tack on /downloads to the end of any
> >   repository to see them).
>
> > * The Downloads API [will] be disabled in 90 days.
>
> So I'll need to migrate my inputs and outputs, or my projects entirely.
> Where to? Your suggestions are appreciated. Some options (numbered
> solely as they spring to mind) include
>
> 1. Bitbucket: also provides free public repos with a Downloads section,
>    like the github status quo ante. (IIUC) upload/download is not
>    scriptable; OTOH its wikis are much prettier than github's. So I'm
>    inclined to migrate my stuff there. Any compelling reasons not to?
>
> 2. Github recommends Amazon S3
>
> https://help.github.com/articles/distributing-large-binaries
>
>    This has a "free tier" for up to 5 GB, which would cover me for now.
>    It would be more separate from the rest of the project (i.e., the
>    code repo) than currently, but that concern may just be purely
>    aesthetic.
>
> 3. Figshare provides free space <= 1 GB, also separate from the rest of
>    the project.
>
> So how would you migrate something like
>
> https://github.com/TomRoche/GEIA_to_NetCDF
>
> presuming you had no external funding (and not much internal funding :-)
> and wanted to keep the data close to the code? Note also that my field
> is atmospheric (rapidly becoming Earth-system) modeling, so bigger is
> definitely better when it comes to data-size limits.
>
> TIA, Tom Roche <Tom_Roche at pobox.com>
>
> _______________________________________________
> open-science mailing list
> open-science at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-science
> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>



-- 
Carl Boettiger
UC Santa Cruz
http://www.carlboettiger.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121212/ad31ef3c/attachment-0001.html>


More information about the open-science mailing list