[open-bibliography] Dataset Metadata

Wed Oct 20 11:30:59 UTC 2010

As ever these all seem to come from the "what shall we provide?" school 
of thinking, rather than "what do people need?"

I really don't know what metadata would make a dataset easier to use, 
and repository managers are the wrong people to ask.

Who's consuming this data? Or not consuming it due to inadequate metadata?

On 20/10/10 12:19, Adrian Pohl wrote:
> Hello,
>
> we have already experimented with dcat at the hbz. This might be of
> interest to Christopher and others thinking about describing open data
> sets, see my post to this list several weeks ago:[1]
>
> <quote>
> I have finally documented in English our experiences with the Data
> Catalog Vocabulary for describing open bibliographic data sets at
> https://wiki1.hbz-nrw.de/display/SEM/Using+dcat+for+Open+Bibliographic+Data.
> On this page you'll find some description examples, a documentation of
> our general problems and our specific problems with dcat as well as
> suggestions how to improve dcat.
>
> These experiments were part of some thinking about the
> conceptualization of an open bibliographic data infrastructure. Felix
> and I published a paper on this some weeks ago, which unfortunately
> only exists in German, see
> http://www.hbz-nrw.de/dokumentencenter/produkte/lod/aktuell/pohl_ostrowski_2010_open-data-infrastruktur.pdf.
> For now our work on this will stop. As soon as more libraries publish
> their data under an open license we certainly will pick up these
> thoughts...
> </quote>
>
> In my perception the work on the Data Catalog Vocabulary seems to
> stand still for some time now which is a pity. There are people on
> this list (Ed Summers and William Waites) who are involved in the work
> on dcat: Do you share this perception? Perhaps we can stimulate a
> reuptake of the work on dcat. I think a proper RDF vocabulary for
> describing data sets is crucial for the development of Open Data.
>
> Since publishing this another possible problem with dcat came up to
> me: I wonder whether it is good that dcat primarily describes
> _catatalogs_ as aggregation of records which describe data sets. I
> believe that an approach which focuses on a vocabulary for describing
> data sets in the first place would be better. Such a vocabulary is
> easier to build, to understand and use and individual descriptions
> could be aggregated ex post to build catalogs.
>
> BTW, I am very much favouring an approach of describing decentralizied
> entities - like data sets or organisations - in a DECENTRAL manner,
> which means: I would like to describe the Open Data from the hbz at
> ONE place on our web site, enrich it with RDFa and then provide
> central registries like CKAN with the URL where they can harvest this
> information. It is very uncomfortable and unneccessary in times of
> Linked Data having to describe the same resources in different places.
> E.g., momentarily I describe hbz Open Data in our wiki[2] and in CKAN
> [3] and I was asked to describe the data in offenedaten.de[4] -  a
> German CKAN installation - as well. (I was told that different CKAN
> installations can't interchange data yet so that I'd have to provide
> both installations with the same descriptions.) I don't want to do
> this. I don't want to describe the same thing multiple times in
> different places. I want to describe the data sets one-time, at best
> on my web page and enable anybody else - by using RDFa - to harvest
> it... As far as I know it was Mark Birbeck who was the first to choose
> this approach for solving the problem he put as follows: "[H]ow could
> they create a centralised web-site of information that the public
> could search and access, when the source of that information could be
> any government department database or any public sector web-site?"[5]
>
> There's another project which should be taken into account in this
> discussion which also develops a vocabulary for data sets - mainly
> research data: DataCite[6].
>
> <quote>
> DataCite is an international consortium to
> - establish easier access to scientific research data on the Internet
> - increase acceptance of research data as legitimate, citable
> contributions to the scientific record, and to
> - support data archiving that will permit results to be verified and
> re-purposed for future study.[7]
> </quote>
>
> In September they released the datacite metadata kernel - a set of 18
> metadata elements for describing research data sets - and asked for
> review.[8] I haven't had time to take a deeper look at it yet but I
> believe that the different vocabularies for describing data sets
> should be harmonized somehow.
>
> In short, I think it is vital to develop vocabularies for describing
> data sets which can be used in CKAN as well as on individual
> institutional web sites. And I think that it is one of the greater
> problems with CKAN that there is now standard way of describing a
> dataset in a more granular manner than now. Thus I think it is vital
> to develop vocabularies for describing data sets which can be used in
> CKAN as well as in other registries or on one's web site. dcat and
> datacite seem to be a good start, they just has to be taken to a
> usable level...
>
> Adrian
>
> [1] http://lists.okfn.org/pipermail/open-bibliography/2010-August/000377.html
>
> [2] https://wiki1.hbz-nrw.de/display/SEM/Recently+published+Open+Data+exports
>
> [3] http://www.ckan.net/package/hbz_unioncatalog, which isn't even up to date.
>
> [4] http://offenedaten.de
>
> [5] http://blogs.talis.com/nodalities/2009/07/rdfa-and-linked-data-in-uk-government-web-sites.php
>
> [6] http://datacite.org/
>
> [7] http://datacite.org/whatisdc.html
>
> [8] See for instance here:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=DC-SCIENCE;695e9016.1009
>
> 2010/10/20 Christopher Gutteridge<cjg at ecs.soton.ac.uk>:
>    
>> First of all, just to make sure people have heard of voID
>> http://semanticweb.org/wiki/VoiD ((from "Vocabulary of Interlinked
>> Datasets") is an RDF based schema to describe linked datasets.) It sounds
>> like it has a heavy overlap with dcat.
>>
>> Secondly, unless I've missed it there's something missing from both and I
>> don't think it can be usefully dealt with by schema, but rather requires
>> convention.
>>
>> That is to tell people what they can expect to get back if they resolve the
>> URI of the dataset, or the URIs of elements within the dataset. What I'm
>> proposing is that people should mint RDF classes which describe a certain
>> pattern or structure. If that pattern is not specific to their archive, they
>> should ideally use a neutral domain for it so that it can be reused by other
>> orgs. without stigma.
>>
>> I plan to write a blog post on this soon, but by way of example, here's the
>> basic ontology for an EPrints repository:  http://www.eprints.org/ontology/
>> -- you'll see that I indicate what you'll get back if you resolve a URI of
>> class EPrint or Repository. This is slightly semantically shonky as it
>> describes a property of the URI and not the concept represented by the URI.
>> sameAs does not apply in this case!
>>
>> While some datasets may be global, and worth the time and effort to build
>> custom interfaces for, others are not. For smaller and local datasets, such
>> as a bibliography, its better to indicate what standard pattern you are
>> using.
>>
>> A) For example, the most simple would be (I'm guessing) a pattern where the
>> bibliography is a single RDF+XML document containing an unordered collection
>> of records containing flat dublin-core metadata.
>>
>> B) The other simple one is a URI which, if resolved, will return an RDF+XML
>> document containing a bunch of triples relating that URI to a set of
>> bibligraphic records held on other systems, simple dublin core may be
>> included, but should not be relied on. rdf:type may be available to indicate
>> what the records may be, or the pattern(s) they fulfill, but again
>> optionally.
>>
>> Basically, it would be really useful for a consuming application to know if
>> you're dealing with an (A) or (B). If (B) then you probably need to resolve
>> all the URIs, in the case of A this is pointless. This may not be a big deal
>> on a single list of 10 items, but machine readable cues about the value of
>> resolving linked data URIs will be important as systems scale.
>>
>> --
>> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
>>
>> / Lead Developer, EPrints Project, http://eprints.org/
>> / Web Projects Manager, ECS, University of Southampton,
>> http://www.ecs.soton.ac.uk/
>> / Webmaster, Web Science Trust, http://www.webscience.org/
>>
>>
>> _______________________________________________
>> open-bibliography mailing list
>> open-bibliography at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-bibliography
>>
>>      
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>    

-- 
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

/ Lead Developer, EPrints Project, http://eprints.org/
/ Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/
/ Webmaster, Web Science Trust, http://www.webscience.org/