[open-bibliography] Dataset Metadata

Wed Oct 20 11:19:23 UTC 2010

Hello,

we have already experimented with dcat at the hbz. This might be of
interest to Christopher and others thinking about describing open data
sets, see my post to this list several weeks ago:[1]

<quote>
I have finally documented in English our experiences with the Data
Catalog Vocabulary for describing open bibliographic data sets at
https://wiki1.hbz-nrw.de/display/SEM/Using+dcat+for+Open+Bibliographic+Data.
On this page you'll find some description examples, a documentation of
our general problems and our specific problems with dcat as well as
suggestions how to improve dcat.

These experiments were part of some thinking about the
conceptualization of an open bibliographic data infrastructure. Felix
and I published a paper on this some weeks ago, which unfortunately
only exists in German, see
http://www.hbz-nrw.de/dokumentencenter/produkte/lod/aktuell/pohl_ostrowski_2010_open-data-infrastruktur.pdf.
For now our work on this will stop. As soon as more libraries publish
their data under an open license we certainly will pick up these
thoughts...
</quote>

In my perception the work on the Data Catalog Vocabulary seems to
stand still for some time now which is a pity. There are people on
this list (Ed Summers and William Waites) who are involved in the work
on dcat: Do you share this perception? Perhaps we can stimulate a
reuptake of the work on dcat. I think a proper RDF vocabulary for
describing data sets is crucial for the development of Open Data.

Since publishing this another possible problem with dcat came up to
me: I wonder whether it is good that dcat primarily describes
_catatalogs_ as aggregation of records which describe data sets. I
believe that an approach which focuses on a vocabulary for describing
data sets in the first place would be better. Such a vocabulary is
easier to build, to understand and use and individual descriptions
could be aggregated ex post to build catalogs.

BTW, I am very much favouring an approach of describing decentralizied
entities - like data sets or organisations - in a DECENTRAL manner,
which means: I would like to describe the Open Data from the hbz at
ONE place on our web site, enrich it with RDFa and then provide
central registries like CKAN with the URL where they can harvest this
information. It is very uncomfortable and unneccessary in times of
Linked Data having to describe the same resources in different places.
E.g., momentarily I describe hbz Open Data in our wiki[2] and in CKAN
[3] and I was asked to describe the data in offenedaten.de[4] -  a
German CKAN installation - as well. (I was told that different CKAN
installations can't interchange data yet so that I'd have to provide
both installations with the same descriptions.) I don't want to do
this. I don't want to describe the same thing multiple times in
different places. I want to describe the data sets one-time, at best
on my web page and enable anybody else - by using RDFa - to harvest
it... As far as I know it was Mark Birbeck who was the first to choose
this approach for solving the problem he put as follows: "[H]ow could
they create a centralised web-site of information that the public
could search and access, when the source of that information could be
any government department database or any public sector web-site?"[5]

There's another project which should be taken into account in this
discussion which also develops a vocabulary for data sets - mainly
research data: DataCite[6].

<quote>
DataCite is an international consortium to
- establish easier access to scientific research data on the Internet
- increase acceptance of research data as legitimate, citable
contributions to the scientific record, and to
- support data archiving that will permit results to be verified and
re-purposed for future study.[7]
</quote>

In September they released the datacite metadata kernel - a set of 18
metadata elements for describing research data sets - and asked for
review.[8] I haven't had time to take a deeper look at it yet but I
believe that the different vocabularies for describing data sets
should be harmonized somehow.

In short, I think it is vital to develop vocabularies for describing
data sets which can be used in CKAN as well as on individual
institutional web sites. And I think that it is one of the greater
problems with CKAN that there is now standard way of describing a
dataset in a more granular manner than now. Thus I think it is vital
to develop vocabularies for describing data sets which can be used in
CKAN as well as in other registries or on one's web site. dcat and
datacite seem to be a good start, they just has to be taken to a
usable level...

Adrian

[1] http://lists.okfn.org/pipermail/open-bibliography/2010-August/000377.html

[2] https://wiki1.hbz-nrw.de/display/SEM/Recently+published+Open+Data+exports

[3] http://www.ckan.net/package/hbz_unioncatalog, which isn't even up to date.

[4] http://offenedaten.de

[5] http://blogs.talis.com/nodalities/2009/07/rdfa-and-linked-data-in-uk-government-web-sites.php

[6] http://datacite.org/

[7] http://datacite.org/whatisdc.html

[8] See for instance here:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=DC-SCIENCE;695e9016.1009

2010/10/20 Christopher Gutteridge <cjg at ecs.soton.ac.uk>:
> First of all, just to make sure people have heard of voID
> http://semanticweb.org/wiki/VoiD ((from "Vocabulary of Interlinked
> Datasets") is an RDF based schema to describe linked datasets.) It sounds
> like it has a heavy overlap with dcat.
>
> Secondly, unless I've missed it there's something missing from both and I
> don't think it can be usefully dealt with by schema, but rather requires
> convention.
>
> That is to tell people what they can expect to get back if they resolve the
> URI of the dataset, or the URIs of elements within the dataset. What I'm
> proposing is that people should mint RDF classes which describe a certain
> pattern or structure. If that pattern is not specific to their archive, they
> should ideally use a neutral domain for it so that it can be reused by other
> orgs. without stigma.
>
> I plan to write a blog post on this soon, but by way of example, here's the
> basic ontology for an EPrints repository:  http://www.eprints.org/ontology/
> -- you'll see that I indicate what you'll get back if you resolve a URI of
> class EPrint or Repository. This is slightly semantically shonky as it
> describes a property of the URI and not the concept represented by the URI.
> sameAs does not apply in this case!
>
> While some datasets may be global, and worth the time and effort to build
> custom interfaces for, others are not. For smaller and local datasets, such
> as a bibliography, its better to indicate what standard pattern you are
> using.
>
> A) For example, the most simple would be (I'm guessing) a pattern where the
> bibliography is a single RDF+XML document containing an unordered collection
> of records containing flat dublin-core metadata.
>
> B) The other simple one is a URI which, if resolved, will return an RDF+XML
> document containing a bunch of triples relating that URI to a set of
> bibligraphic records held on other systems, simple dublin core may be
> included, but should not be relied on. rdf:type may be available to indicate
> what the records may be, or the pattern(s) they fulfill, but again
> optionally.
>
> Basically, it would be really useful for a consuming application to know if
> you're dealing with an (A) or (B). If (B) then you probably need to resolve
> all the URIs, in the case of A this is pointless. This may not be a big deal
> on a single list of 10 items, but machine readable cues about the value of
> resolving linked data URIs will be important as systems scale.
>
> --
> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
>
> / Lead Developer, EPrints Project, http://eprints.org/
> / Web Projects Manager, ECS, University of Southampton,
> http://www.ecs.soton.ac.uk/
> / Webmaster, Web Science Trust, http://www.webscience.org/
>
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>