[open-science] Why not publish data?

Mon Mar 16 06:00:26 UTC 2009

is the title of a post just published on my blog:
http://www.gavinbaker.com/2009/03/16/why-not-publish-data/

I'm eager for comments and critique. Feel free to comment on the blog
post or respond on this list. I'll copy the post before to facilitate
discussion:

> I try to avoid writing things that may make me sound stupid, but this
> post falls in that category.
> 
> Recently I was reading about efforts related to data sharing:
> technological infrastructure, curation, educating researchers, and
> the like. I was struck by the thought that most of the advocacy for
> data sharing boils down to an exhortation to stick it in a digital
> repository.
> 
> This seems a bit odd considering that much of what propels science is
> the pressure to publish (written) results (in journals, conferences,
> monographs, etc.). There is a hierarchy of venues in terms of
> prestige, which is in turn linked to research funding, promotion,
> public attention (media coverage, policy influence), etc.
> 
> Might the best way to get researchers to share data be to create a
> similar system for datasets? It might provide a compelling incentive.
> 
> Moreover, publishing might provide a compelling incentive to the
> related issue of data curation (making data understandable / usable
> to others, e.g. through formatting, annotation, etc.). Currently,
> much data doesn't see much use outside the lab where it was
> generated, so researchers have little incentive to spend time
> "prettying it up" for others (who may find the way it was recorded to
> be inscrutable). Even if they are convinced to "share" their data by
> posting it online, it may seem quite a low priority to spend time
> making it useful to others. If there was pressure to publish the
> dataset, though, then researchers would have that incentive to make
> the data as intuitively useful to others as practicable, so reviewers
> could quickly identify the novelty of the data.
> 
> This doesn't seem so outlandish to me. There are similar efforts to
> provide publication fora for materials which were not traditionally
> unpublished (we might say undersupplied), such as negative results
> and experimental techniques.
> 
> If you think of it in terms of a CV, the difference is between these
> lines:
> 
> * Created and shared large, valuable dataset which is highly regarded
> by peers 
> * Publication in J. Big Useful Datasets, impact factor X
> 
> It may be hard for a reviewer to quantify or validate the former; the
> latter demonstrates that the researcher's contribution has already
> been validated and provides built-in metrics to quantify the
> contribution.
> 
> There are other ways to skin the same cat. One option would be to
> build alternative systems for conferring recognition (e.g. awards,
> metrics for contributions to shared datasets, etc.). The other
> approach is to make data sharing a more enforceable part of other
> scientific endeavors, e.g. mandatory as a condition of research
> funding, mandatory as a condition of publication (of written results)
> in a journal, etc. I think multiple approaches will yield the best
> result. It seems to me that creating "journals" (or some other name)
> for "publishing" datasets could be a useful way to spur
> participation.
> 
> Has this been done already? What are the drawbacks to this approach?

Best,
-- 
Gavin Baker
http://www.gavinbaker.com/
gavin at gavinbaker.com

Science is not everything, but science is very beautiful.
    J. Robert Oppenheimer