[open-science] github breaks data hosting: alternatives?

Matt Jones jones at nceas.ucsb.edu
Wed Dec 12 23:27:57 UTC 2012


Hi Carl --

The persistent archive is maintained by replication across institutional
boundaries, where a monitoring system keeps track of the data replicas
across all of the DataONE Member Nodes and makes sure that, if a node
disappears, that the data replication policies for data it held are still
followed.  There is a video showing how this system works here:
  http://www.dataone.org/depositing-data-into-dataone

The model is conceptually similar to a LOCKSS model, but the replication is
handled through a standard REST API, and with coordination among nodes from
the federation.  Replication policies for each object are determined by
initial contributors.

The DataONE API is documented, including both the data access and data
contribution REST interfaces, here:
    http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html
The API should work identically regardless of which DataONE compatible
repository you use, as we are trying to simplify interoperability among
these repositories.

In terms of what can be uploaded, that is determined by the participating
Member Nodes -- each node controls who can upload and what they can upload.
 For example, Dryad is participating in DataONE, and they will only be
allowing data associated with journal papers to be uploaded.   In contrast,
the KNB allows any data to be uploaded that is relevant to science and is
legal to redistribute.  DataONE recognizes that many metadata standards are
in use, and so it supports all of the common standards, and can be easily
extended to support additional standards.  For ecological and environmental
data, either EML, FGDC, or ISO19115 compliant metadata makes most sense,
although some groups use others (e.g., Dryad uses METS).  So, its really up
to the contributor.  There are no minimum science metadata requirements,
but there are a few system level metadata that are required (such as a
unique identifier, the type of the file, its access policy, etc).

In DataONE, the data use policies are set for each data package, rather
than for each repository.  Thus, any given repository can have data that
are CC0, CC-BY, or other licenses, and the user must inspect the science
metadata for each data package to determine what rights they have -- it is
specified by the data contributor.

Hope this helps clarify.

Matt



On Wed, Dec 12, 2012 at 1:44 PM, Carl Boettiger <cboettig at gmail.com> wrote:

> Matt,
>
> Definitely a good point about both archiving stability and DataONE
> integration.  I've actually never tried the interface to DAAC or KNB for
> uploading data -- I see only information on how to download there data.
> What license do they use? What restrictions are there on the kind of data
> that can be uploaded, and required metadata? How is the persistent archive
> maintained (e.g. LOCKSS/CLOCKSS or something else?)  (Sorry for the basic
> questions, maybe you could you provide some links, my google-fu is failing
> me).
>
> Tom,
>
> Re: figshare, data there is also given a DOI, files are permanently
> archived by external geopolitical distributed http://clockss.org service,
> and can be submitted and downloaded or browsed by the API.  Ruby, python,
> and R packages exist for interfacing with the API.  (I helped write
> rfigshare <https://github.com/ropensci/rfigshare>).  Most filetypes are
> supported though plain text has archival advantages, minimal metadata is
> author, title, a tag and a category.  These features of Figshare give you
> some of the same advantages Matt mentions, but being part of the DataONE
> ecosystem via these more established repositories has clear benefits too.
>
> - Carl
>
>
> On Wed, Dec 12, 2012 at 2:31 PM, Matt Jones <jones at nceas.ucsb.edu> wrote:
>
>> Tom --
>>
>> Another option is to use one of the open community repositories dedicated
>> for science data archival and preservation.  For example, most of the
>> DataONE enabled repositories like the KNB and ORNL DAAC enable long-term
>> storage of data with an eye towards replication and reliability over
>> multiple decades (the KNB is 14 years old, and the ORNLDAAC is 16 years
>> old).  Using the DataONE API, you reference the globally unique identifier
>> for a data set (such as a DOI) in your R scripts, rather than a specific
>> file location or web location.  This allows the script to remain portable
>> and functional across systems, even when used in differing environments or
>> when the location of the data changes over time (because the DOI is used to
>> resolve the current location of the data, not the historical location).
>>  The DataONE R library is in testing now, and we'd love to have some folks
>> take a look at it if you are interested.
>>
>> Matt
>>
>>
>>
>> On Wed, Dec 12, 2012 at 12:51 PM, Carl Boettiger <cboettig at gmail.com>wrote:
>>
>>> Hi Tom,
>>>
>>> The Github change is really a change of interface rather than a change
>>> of functionality.  You could always commit the data files to a directory in
>>> the repository (particularly if they are text files such as netCDF, but
>>> even if they are binary) and link them from the repo's readme.   (If they
>>> are binaries you probably want to add `?raw=true` at the end of the url so
>>> the browser attempts to download the file instead of view on github, e.g.
>>> https://github.com/cboettig/labnotebook/blob/master/assets/files/coloredNoise.pdf?raw=true).
>>>
>>>
>>> Figshare provides unlimited space for public files, it just limits the
>>> size of individual files (250mb/per file).  It is only the private upload
>>> space that is limited to 1GB.  If I had big datasets I wanted to make
>>> public, I'd probably stick them here and link them clearly from the
>>> README.md on Github.  (Or if you go the S3 route, or CKAN or
>>> personal/institutional server, you could still add the link to keep things
>>> "in one place".  Likewise simply link back from the figshare page to the
>>> github repo, etc.)
>>>
>>> - Carl
>>>
>>>
>>>
>>>
>>> On Wed, Dec 12, 2012 at 11:39 AM, Tom Roche <Tom_Roche at pobox.com> wrote:
>>>
>>>>
>>>> [caveat: I'm a student, without much scientific experience. Please
>>>> correct the following as needed; your comments are also appreciated.]
>>>>
>>>> Much science informatics involves manipulation of input data, e.g., to
>>>> produce visualizations, or just refined data ("analyses," in met-speak)
>>>> for another stage in a pipeline. It's therefore useful for open-science
>>>> projects to host not only code but its associated inputs and outputs
>>>> (here called "I+O"). I found github's Downloads function (and esp
>>>> its scriptable Downloads API) convenient for this purpose, e.g., for
>>>>
>>>> https://github.com/TomRoche/GEIA_to_NetCDF
>>>>
>>>> I was able to host the code in the repository, and its I+O in Downloads,
>>>> keeping everything together in one project website. However,
>>>>
>>>> https://github.com/blog/1302-goodbye-uploads
>>>> > December 11, 2012
>>>> ...
>>>> > GitHub previously allowed you to upload files (separate from the
>>>> > versioned files) in the repository, and make it available for download
>>>> > in the Downloads Tab. Supporting these types of uploads was a source
>>>> > of great confusion and pain – they were too similar to the files in a
>>>> > Git repository. As part of our ongoing effort to keep GitHub focused
>>>> > on building software, we are deprecating the Downloads Tab.
>>>>
>>>> > * The ability to upload new files via the web site is disabled today.
>>>>
>>>> > * Existing links to previously uploaded files will continue to work
>>>> >   for the foreseeable future.
>>>>
>>>> > * Repositories that already have uploads will continue to list their
>>>> >   downloads for the next 90 days (tack on /downloads to the end of any
>>>> >   repository to see them).
>>>>
>>>> > * The Downloads API [will] be disabled in 90 days.
>>>>
>>>> So I'll need to migrate my inputs and outputs, or my projects entirely.
>>>> Where to? Your suggestions are appreciated. Some options (numbered
>>>> solely as they spring to mind) include
>>>>
>>>> 1. Bitbucket: also provides free public repos with a Downloads section,
>>>>    like the github status quo ante. (IIUC) upload/download is not
>>>>    scriptable; OTOH its wikis are much prettier than github's. So I'm
>>>>    inclined to migrate my stuff there. Any compelling reasons not to?
>>>>
>>>> 2. Github recommends Amazon S3
>>>>
>>>> https://help.github.com/articles/distributing-large-binaries
>>>>
>>>>    This has a "free tier" for up to 5 GB, which would cover me for now.
>>>>    It would be more separate from the rest of the project (i.e., the
>>>>    code repo) than currently, but that concern may just be purely
>>>>    aesthetic.
>>>>
>>>> 3. Figshare provides free space <= 1 GB, also separate from the rest of
>>>>    the project.
>>>>
>>>> So how would you migrate something like
>>>>
>>>> https://github.com/TomRoche/GEIA_to_NetCDF
>>>>
>>>> presuming you had no external funding (and not much internal funding :-)
>>>> and wanted to keep the data close to the code? Note also that my field
>>>> is atmospheric (rapidly becoming Earth-system) modeling, so bigger is
>>>> definitely better when it comes to data-size limits.
>>>>
>>>> TIA, Tom Roche <Tom_Roche at pobox.com>
>>>>
>>>> _______________________________________________
>>>> open-science mailing list
>>>> open-science at lists.okfn.org
>>>> http://lists.okfn.org/mailman/listinfo/open-science
>>>> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>>>>
>>>
>>>
>>>
>>> --
>>> Carl Boettiger
>>> UC Santa Cruz
>>> http://www.carlboettiger.info/
>>>
>>>
>>> _______________________________________________
>>> open-science mailing list
>>> open-science at lists.okfn.org
>>> http://lists.okfn.org/mailman/listinfo/open-science
>>> Unsubscribe: http://lists.okfn.org/mailman/options/open-science
>>>
>>>
>>
>
>
> --
> Carl Boettiger
> UC Santa Cruz
> http://www.carlboettiger.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-science/attachments/20121212/e7476879/attachment-0001.html>


More information about the open-science mailing list