[ckan4rdm] Short introduction of project EDaWaX

Thu Apr 25 12:25:51 UTC 2013

On Mon, Apr 22, 2013 at 04:38:31PM +0300, Joss Winn wrote:
> What makes research data different to open government data? I think (I
> could be wrong!) that the research data community has a lot of overlap
> with the institutional repository community (often it is the same person
> performing both tasks) and that 'research data' implies a curatorial
> approach to data as practised by archivists and librarians. I suspect this
> is different to the way that CKAN is being deployed in the government
> sector. 

In Finland, we are also looking into using CKAN as the platform for
national _research_ data (i.e. data produced in research activities),
not only government data that can have secondary use as research data.
As we have already quite a bit of experience in the differences between
administrative data collected by officials vs. research data collected
by researchers, I thought I'd share my 2c.

I think two issues especially stand out when it comes to administrative
data vs. research data:

1) The primary purpose of the data differs.  Government/administrative
data is collected to fulfill a legal obligation or task, and the meaning
of administrative databases is usually defined by the operational
semantics of how that data is used.  For instance, the data about the
water pipe network of a given municipality is used for tracking down
breaks, assessing future needs, etc.  Thus, the metadata is often more
in the lines of what the pieces of information are used for.  Whereas
data collected for research purposes is usually specifically aimed at
answering a specific research question.  For instance, the data about
attitudes towards immigrants is used to track down some specific
research problem such as the correlation between attitudes and hate
violence.  Thus, the metadata is often more in the lines of how the data
was collected and how it is relevant to the question at hand.

Now, when data is found in a data archive / portal, the reuser is almost
always _repurposing_ the data for their needs.  Repurposing government
data and research data for secondary use is different, because they have
different primary uses and different kinds of metadata available.

2) Because administrative data is usually used to fulfill a legal
obligation, the data _must_ be owned by the organization and there must
be somebody appointed to manage the data.  Whereas research data is
currently often a byproduct of a research report, a necessary
intermediate form that is only used to produce a research report which
is then reported back to the research funder(s) and stakeholder(s).  As
a result, there often is no party at all responsible for managing
research data, so archiving and adequately describing the dataset is not
necessarily anyone's business.  Also, the data ownership issues as dim,
and it is not clear if researchers' home organisations can publish the
research data without getting a permission from the data authors.

> I don't know for sure, but the impression I get is that the effort
> by public sector open data enthusiasts deploying CKAN has not yet
> addressed the long-term curatorial functions of archiving datasets. The
> data published on data.gov.uk, for example, is already archived elsewhere.
> CKAN is not being used as the primary archival tool, but rather as a
> discovery/publishing tool. I think this discussion list is interested in
> how CKAN can be more than that.

There are some software projects that are more aimed at archiving and
long term preservation of data for reuse, and I certainly agree that
CKAN is not exploring this ground much.  However, I'm not really
convinced that typical curation activities -- such as extensive metadata,
archival of data analysis and production methods, recording data
lifecycle events, and doing migrations -- should be handled by CKAN
directly, but rather other software components that integrate with CKAN.

> If so, this is similar to what we've done at Lincoln, where we use the
> CKAN APIs to incorporate CKAN into a curatorial deposit workflow that
> retrieves a datacite DOI and deposits metadata into Eprints, which is the
> canonical record of institutional research outputs and data.

I think maybe PID's and checksums are one important aspect where CKAN
should get long-term preservation of data into account.  PID's are best
assigned, and checksums best calculated, at as early a stage in the data
lifecycle as possible.  We've also implemented PID assignment, and are
looking into implementing metadata exports to DataCite / DCI / whatnot.

with friendly greetings,
Panu Kalliokoski			panu.kalliokoski at csc.fi
application specialist			+358 41 5323835
CSC - Tieteen tietotekniikan keskus oy