Dear Finn,

What you are doing sounds very interesting -- and you are not alone in
your wishing to extract and aggregate data from scientific
publications which can then be distributed openly. For example, Peter
Murray-Rust Chem-Informatics group at the University of Cambridge does
similar things in chemistry (see e.g. the Crystal Eye project [1]).


Regarding your specific questions:

1. I think you are right that your aggregated material may count as a
'Database' when aggregated.

2. The key question is regarding the original data in the
publications. In my view (but I should warn IANAL) it is doubtful that
the small sets of 'factual data' in individual papers is copyrightable
or protected as a DB (and even if it was your extraction might count
as 'small'). However, I would also note that some publishers do
explicitly claim control of data (e.g. the American Chemical Society I

So you've got two options here:

1. Continue as you are. Even if eventually someone gets upset you can
either argue with them then or take material out (this is the 'seek
forgiveness not permission' approach). One advantage of this route is
that you already have a useful tool by the time any debate starts
which can make quite a difference to the outcome ...

2. You can try and tell publishers what you are doing in advance and
see what happens.

At this point I should mention a project being developed by the
Working Group on Open Data in Science here at the Foundation

Entitled 'Is It Open Data', it's a service to make it easy for people
(scientists especiallhttp://www.opendatacommons.org/licenses/fil/y) to
make enquiries to publishers (and others) about the openness of the
scientific data they hold -- and to record publicly the results of
those efforts. You can see the current (very alpha) version here:


The FAQ/Guide may be particularly relevant given your situation:




2009/2/26 Finn Aarup Nielsen <fn at imm.dtu.dk>:
> I have recently started a wiki with scientific data and text in
> neuroscience: Brede Wiki, http://neuro.imm.dtu.dk/wiki/
> I started with triple licenses of the share-alike type: GPL, GFDL and
> CC-by-sa: http://neuro.imm.dtu.dk/wiki/Brede:Copyrights
> After browsing the Open Knowledge web-site and following links
> to licenses it seems to me that the situation is more complex.
> I manually extract data from scientific articles (more precisely results
> from statistical analyses in neuroimaging experiments) and encode them in
> MediaWiki templates so my Brede Wiki now contains content like "{{Brain
> volume | n = 1 | region = Left hippocampus | mean = 0.940 | std = 0.208 |
> unit = cm3 | group = Major depression patients }}", see the wiki page:
> http://neuro.imm.dtu.dk/wiki/Hippocampal_volume_reduction_in_major_depression
> Such data would typically be found in tables of the scientific paper. Some
> of the papers are CC-by, but most are copyrighted by commercial
> publishers.
> I have thought that such data would be "facts" or "measurement" on nature,
> not be subjected to copyright, but that I and other wiki-contributors
> would be able to gain Database Rights when they become aggregated with
> other results. I have seen the "Open Database License" which seems
> appropriate for distributing the database. However, it is not clear to me
> whether "left hippocampus 0.940" constitutes a copyrightable entity (a
> creative work?) belonging to the publisher or it can be considered a fact
> falling in under "Open Data Commons - Factual info licence". The worst
> case would be that publishers regard this data as under their copyright
> and regard its presentation on the web-site and its copylefted
> distribution as a violation.
> Any thought on this? I have just seen that there is some discussion in the
> paper:
> http://events.linkeddata.org/ldow2008/papers/08-miller-styles-open-data-commons.pdf
> /Finn
