[open-science] Why not publish data?

Mon Mar 16 16:06:13 UTC 2009

Also left this as a comment at the blog:

I think it depends a lot on what you mean by a "journal". I would be against
the idea of creating a formal journal (with peer review, page numbers,
indexes) for data because I think it would retrograde step. The web was
built to house datasets and we are now building an infrastructure that will
let this be done properly. I think it much more effective to build purpose
built systems designed to hold, aggregate, and process data in sensible, and
web native, ways.

If however you mean that there is value in making datasets citeable objects
and providing incentives for people to "publish" datasets in a useful way
then I am all for it. The crystallography community have been most effective
at creating "data journals" (the Acta Cryst set) and it is because of these
journals that crystallographers are amongst the most cited and highly
published scientists.

The flip side of this is the tendency of people to try and push "dataset
shaped pegs into paper sized holes". Many of the papers that describe both
biological structures and genome sequences seem to be a dataset in a
desperate search for some sort of interesting statement that can be hung off
it. Many interesting and important structures and sequences would have been
better published as "here it is, make of it what you will" rather than
trying to craft a somewhat nebulous paper around it.

The same could be said of the publication of software tools (and indeed
databases). People are desperate to publish a paper so they can add it to
their CV and so that people have something to cite. Not because a paper is a
good way of describing or documenting the work, but because it is the only
way people can currently get credit.

So my argument would be that we need to expand the way we use and apply
citations, and make much more serious efforts to make databases, web pages,
tools, etc all solid citeable objects. Otherwise we are going to end up with
a lot more journals (don't we have enough already?) and all the additional
overheads that go with it. If the purpose of creating a journal is only to
make the dataset citeable then I think there are better ways of going about
it. 

On 16/3/09 06:00, "Gavin Baker" <gavin at gavinbaker.com> wrote:

> is the title of a post just published on my blog:
> http://www.gavinbaker.com/2009/03/16/why-not-publish-data/
> 
> I'm eager for comments and critique. Feel free to comment on the blog
> post or respond on this list. I'll copy the post before to facilitate
> discussion:
> 
>> I try to avoid writing things that may make me sound stupid, but this
>> post falls in that category.
>> 
>> Recently I was reading about efforts related to data sharing:
>> technological infrastructure, curation, educating researchers, and
>> the like. I was struck by the thought that most of the advocacy for
>> data sharing boils down to an exhortation to stick it in a digital
>> repository.
>> 
>> This seems a bit odd considering that much of what propels science is
>> the pressure to publish (written) results (in journals, conferences,
>> monographs, etc.). There is a hierarchy of venues in terms of
>> prestige, which is in turn linked to research funding, promotion,
>> public attention (media coverage, policy influence), etc.
>> 
>> Might the best way to get researchers to share data be to create a
>> similar system for datasets? It might provide a compelling incentive.
>> 
>> Moreover, publishing might provide a compelling incentive to the
>> related issue of data curation (making data understandable / usable
>> to others, e.g. through formatting, annotation, etc.). Currently,
>> much data doesn't see much use outside the lab where it was
>> generated, so researchers have little incentive to spend time
>> "prettying it up" for others (who may find the way it was recorded to
>> be inscrutable). Even if they are convinced to "share" their data by
>> posting it online, it may seem quite a low priority to spend time
>> making it useful to others. If there was pressure to publish the
>> dataset, though, then researchers would have that incentive to make
>> the data as intuitively useful to others as practicable, so reviewers
>> could quickly identify the novelty of the data.
>> 
>> This doesn't seem so outlandish to me. There are similar efforts to
>> provide publication fora for materials which were not traditionally
>> unpublished (we might say undersupplied), such as negative results
>> and experimental techniques.
>> 
>> If you think of it in terms of a CV, the difference is between these
>> lines:
>> 
>> * Created and shared large, valuable dataset which is highly regarded
>> by peers 
>> * Publication in J. Big Useful Datasets, impact factor X
>> 
>> It may be hard for a reviewer to quantify or validate the former; the
>> latter demonstrates that the researcher's contribution has already
>> been validated and provides built-in metrics to quantify the
>> contribution.
>> 
>> There are other ways to skin the same cat. One option would be to
>> build alternative systems for conferring recognition (e.g. awards,
>> metrics for contributions to shared datasets, etc.). The other
>> approach is to make data sharing a more enforceable part of other
>> scientific endeavors, e.g. mandatory as a condition of research
>> funding, mandatory as a condition of publication (of written results)
>> in a journal, etc. I think multiple approaches will yield the best
>> result. It seems to me that creating "journals" (or some other name)
>> for "publishing" datasets could be a useful way to spur
>> participation.
>> 
>> Has this been done already? What are the drawbacks to this approach?
> 
> Best,

-- 
Scanned by iCritical.