[open-science] github breaks data hosting: alternatives?

Carl Boettiger cboettig at gmail.com
Wed Dec 12 22:44:32 UTC 2012


Matt,

Definitely a good point about both archiving stability and DataONE
integration.  I've actually never tried the interface to DAAC or KNB for
uploading data -- I see only information on how to download there data.
What license do they use? What restrictions are there on the kind of data
that can be uploaded, and required metadata? How is the persistent archive
maintained (e.g. LOCKSS/CLOCKSS or something else?)  (Sorry for the basic
questions, maybe you could you provide some links, my google-fu is failing
me).

Tom,

Re: figshare, data there is also given a DOI, files are permanently
archived by external geopolitical distributed http://clockss.org service,
and can be submitted and downloaded or browsed by the API.  Ruby, python,
and R packages exist for interfacing with the API.  (I helped write
rfigshare <https://github.com/ropensci/rfigshare>).  Most filetypes are
supported though plain text has archival advantages, minimal metadata is
author, title, a tag and a category.  These features of Figshare give you
some of the same advantages Matt mentions, but being part of the DataONE
ecosystem via these more established repositories has clear benefits too.

- Carl


On Wed, Dec 12, 2012 at 2:31 PM, Matt Jones <jones at nceas.ucsb.edu> wrote:

> Tom --
>
> Another option is to use one of the open community repositories dedicated
> for science data archival and preservation.  For example, most of the
> DataONE enabled repositories like the KNB and ORNL DAAC enable long-term
> storage of data with an eye towards replication and reliability over
> multiple decades (the KNB is 14 years old, and the ORNLDAAC is 16 years
> old).  Using the DataONE API, you reference the globally unique identifier
> for a data set (such as a DOI) in your R scripts, rather than a specific
> file location or web location.  This allows the script to remain portable
> and functional across systems, even when used in differing environments or
> when the location of the data changes over time (because the DOI is used to
> resolve the current location of the data, not the historical location).
>  The DataONE R library is in testing now, and we'd love to have some folks
> take a look at it if you are interested.
>
> Matt
>
>
>
> On Wed, Dec 12, 2012 at 12:51 PM, Carl Boettiger <cboettig at gmail.com>wrote:
>
>> Hi Tom,
>>
>> The Github change is really a change of interface rather than a change of
>> functionality.  You could always commit the data files to a directory in
>> the repository (particularly if they are text files such as netCDF, but
>> even if they are binary) and link them from the repo's readme.   (If they
>> are binaries you probably want to add `?raw=true` at the end of the url so
>> the browser attempts to download the file instead of view on github, e.g.
>> https://github.com/cboettig/labnotebook/blob/master/assets/files/coloredNoise.pdf?raw=true).
>>
>>
>> Figshare provides unlimited space for public files, it just limits the
>> size of individual files (250mb/per file).  It is only the private upload
>> space that is limited to 1GB.  If I had big datasets I wanted to make
>> public, I'd probably stick them here and link them clearly from the
>> README.md on Github.  (Or if you go the S3 route, or CKAN or
>> personal/institutional server, you could still add the link to keep things
>> "in one place".  Likewise simply link back from the figshare page to the
>> github repo, etc.)
>>
>> - Carl
>>
>>
>>
>>
>> On Wed, Dec 12, 2012 at 11:39 AM, Tom Roche <Tom_Roche at pobox.com> wrote:
>>
>>>
>>> [caveat: I'm a student, without much scientific experience. Please
>>> correct the following as needed; your comments are also appreciated.]
>>>
>>> Much science informatics involves manipulation of input data, e.g., to
>>> produce visualizations, or just refined data ("analyses," in met-speak)
>>> for another stage in a pipeline. It's therefore useful for open-science
>>> projects to host not only code but its associated inputs and outputs
>>> (here called "I+O"). I found github's Downloads function (and esp
>>> its scriptable Downloads API) convenient for this purpose, e.g., for
>>>
>>> https://github.com/TomRoche/GEIA_to_NetCDF
>>>
>>> I was able to host the code in the repository, and its I+O in Downloads,
>>> keeping everything together in one project website. However,
>>>
>>> https://github.com/blog/1302-goodbye-uploads
>>> > December 11, 2012
>>> ...
>>> > GitHub previously allowed you to upload files (separate from the
>>> > versioned files) in the repository, and make it available for download
>>> > in the Downloads Tab. Supporting these types of uploads was a source
>>> > of great confusion and pain – they were too similar to the files in a
>>> > Git repository. As part of our ongoing effort to keep GitHub focused
>>> > on building software, we are deprecating the Downloads Tab.
>>>
>>> > * The ability to upload new files via the web site is disabled today.
>>>
>>> > * Existing links to previously uploaded files will continue to work
>>> >   for the foreseeable future.
>>>
>>> > * Repositories that already have uploads will continue to list their
>>> >   downloads for the next 90 days (tack on /downloads to the end of any
>>> >   repository to see them).
>>>
>>> > * The Downloads API [will] be disabled in 90 days.
>>>
>>> So I'll need to migrate my inputs and outputs, or my projects entirely.
>>> Where to? Your suggestions are appreciated. Some options (numbered
>>> solely as they spring to mind) include
>>>
>>> 1. Bitbucket: also provides free public repos with a Downloads section,
>>>    like the github status quo ante. (IIUC) upload/download is not
>>>    scriptable; OTOH its wikis are much prettier than github's. So I'm
>>>    inclined to migrate my stuff there. Any compelling reasons not to?
>>>
>>> 2. Github recommends Amazon S3
>>>
>>> https://help.github.com/articles/distributing-large-binaries
>>>
>>>    This has a "free tier" for up to 5 GB, which would cover me for now.
>>>    It would be more separate from the rest of the project (i.e., the
>>>    code repo) than currently, but that concern may just be purely
>>>    aesthetic.
>>>
>>> 3. Figshare provides free space <= 1 GB, also separate from the rest of
>>>    the project.
>>>
>>> So how would you migrate something like
>>>
>>> https://github.com/TomRoche/GEIA_to_NetCDF
>>>
>>> presuming you had no external funding (and not much internal funding :-)
>>> and wanted to keep the data close to the code? Note also that my field
>>> is atmospheric (rapidly becoming Earth-system) modeling, so bigger is
>>> definitely better when it comes to data-size limits.
>>>
>>> TIA, Tom Roche <Tom_Roche at pobox.com>
>>>
>>> _______________________________________________
>>> open-science mailing list
>>> open-science at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/open-science
>>> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>>>
>>
>>
>>
>> --
>> Carl Boettiger
>> UC Santa Cruz
>> http://www.carlboettiger.info/
>>
>>
>> _______________________________________________
>> open-science mailing list
>> open-science at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-science
>> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>>
>>
>


-- 
Carl Boettiger
UC Santa Cruz
http://www.carlboettiger.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121212/795709f5/attachment-0001.html>


More information about the open-science mailing list