[open-bibliography] OCLC adds Linked Data to WorldCat.org | DDC 23 released as linked data at dewey.info

Tue Jun 26 15:31:55 UTC 2012

On Sun, Jun 24, 2012 at 9:38 PM, Karen Coyle <kcoyle at kcoyle.net> wrote:

> The records themselves are not huge, at least not in the native MARC format
> -- they tend to be in the 1K-1.5K range, and by ignoring some of the more
> obscure data would be smaller - perhaps even half that amount. It's the
> sheer number of fields that I think becomes an issue, not so much their
> size. I would guess at an average of 10 fields per record, although that
> depends on how you break out the data (e.g. if published date is considered
> a separate field or part of the same string with the publisher name).

Publishing the whole data dump should not be a big technical problem
(apart from reserving disk space for it).

It would not be much larger than the billion triple dataset (
http://km.aifb.kit.edu/projects/btc-2011/ ) consisting of ~2bn
triples, which archived is ~20Gb (for users' convenience it is divided
into ~200 gzipped N-triples files).

Ability to process such a large dataset is another question. If the
full dump is available, then other volunteers could "digest" the
dataset and provide fellow open data users with smaller, easier-to-use
subsets. This slicing can be done using a regular home computer
provided that ntriples files are processed "on-the-fly". Thus OCLC
would shift the work of slicing the dataset into the hands of
volunteers and would only need to provide raw data.

Others could provide SPARQL interfaces [to the full dataset or slices
of it] though that would require a lot more resources. Probably can be
done by research centres or companies that develop RDF data stores.

Technical questions that remain are:
1) converting data into a format that is convenient for data dumps
2) providing updates (dumps with new/changed records?)

P.S. Looks like N-triples is the format of choice when providing RDF data dumps.

Uldis