[okfn-discuss] Guardian article on OCLC and bibliographic data

Thu Jan 22 01:11:00 UTC 2009

Including comments from Aaron Swartz, Karen Coyle and Rob Styles.

J.

## Why you can't find a library book in your search engine
## Wendy M Grossman
## The Guardian, Thursday 22 January 2009
## Finding a book at your local library should just involve a simple
web search. But thanks to a US cataloguing site, that is far from the
case

Despite the internet's origins as an academic network, when it comes
to finding a book, e-commerce rules. Put any book title into your
favourite search engine, and the hits will be dominated by commercial
sites run by retailers, publishers, even authors. But even with your
postcode, you won't find the nearest library where you can borrow that
book. (The exception is Google Books, and even that is limited.)

That's strange, because almost every library has an electronic
database of its books - searchable either at the library's own website
or via its local council. The wrinkle is that at the book level, those
databases aren't accessible to the search engines; and you may not be
able to search all the libraries in your area at once.

Bibliographic data

Yet there is an alternative that few people seem aware of: Worldcat
(worldcat.org), which offers web access to the largest repository of
bibliographic data in the world - from the 40-year-old Ohio-based
non-profit Online Computer Library Center (oclc.org). But Worldcat
suffers from the same problem on a larger scale. OCLC shares only 3m
of its 125m records with Google Books; none of them show up in an
ordinary search.

You might expect forward-thinking libraries to put their databases
online, to encourage people through their doors. But they can't. Even
though they created the data, pay to have records added to the
database and pay to download them, they can't.

In November, OCLC announced new rules covering the use of Worldcat
data due to go live on 19 February. Among other things, the new policy
prohibits any use - transfer, sharing - that "substantially replicates
the function, purpose, and/or size of WorldCat". In other words, no
publicly searchable databases.

"It's safe to say that the policy change is a direct response to Open
Library," says Aaron Swartz, the founder of Open Library
(openlibrary.org), a project to give every published book its own
Wikipedia-style page. "Since the beginning of Open Library, OCLC has
been threatening funders, pressuring libraries not to work with us,
and using tricks to try to shut us down. It didn't work - and so now
this."

Open Library is one of several projects aiming to bring book data into
the internet age. LibraryThing (librarything.com), for example, lets
users share the contents of their libraries; if you and I have
favourite books in common, maybe the other books you have are ones I'd
like. Under OCLC's new policy, would libraries be unable to share
their data with these projects?

Karen Calhoun, the vice-president of OCLC WorldCat and metadata
services, believes it's important for OCLC - whose annual revenues, as
of June 2008, were $246m (£175m), and which in recent years has bought
several smaller commercial competitors in Europe - to be the only big
kid on the block, and to ensure that "the WorldCat commons is not
exhausted through over-exploitation. Protecting the commons means
adopting 'some rights reserved' as the data-sharing model."

Over-exploitation, she says, would be "to have lots of these stores in
different places on the web that disperse the information and we don't
have a way to connect it all back up again".

Besides, Calhoun adds: "Trying to operate on web scale on behalf of
libraries really does take a businesslike approach." Local libraries,
she says, are too small to do their own negotiating.

Yet millions of website owners and bloggers do not negotiate with
Google to have their sites crawled and available on results pages.
Open Library's 1m records have open APIs and are available for
download as a single data dump. There is even a plug-in for WordPress
that lets bloggers automatically integrate a link to the Open Library
page of any book mentioned.

"The library world is set up on this model where the library is a
physical building and has a number of books and serves a geographical
community," says Swartz. "Our model is find the book you're interested
in and give you the metadata - and then find the best way to get it to
you."

In the politely acrimonious debate that has followed OCLC's
announcement, WorldCat's copyright status was raised. In the US,
collections of facts don't get copyright protection. In 1998 the EU
created "database right" - but individual records can't be
copyrighted. Those suspicious that OCLC is attempting a power grab
believe uncertainty over copyright law may be behind the new policy:
if OCLC can't rely on intellectual property law, a contract - the new
policy - is its only choice.

Calhoun says OCLC's legal department is still researching the
copyright question, explaining that courts have in the past considered
"sweat of the brow": creating a bibliographic record, she says,
requires intellectual effort and judgments by trained personnel.

Changing world

Richard Wallis, a technology evangelist at Talis, which competes with
OCLC in interlibrary lending systems in Europe, thinks OCLC's main
problem is that it has not kept pace with the changing world.

"They're still stuck in the wrong business model," he says. "It was
expensive, 20 or 30 years ago, to set up a large dataset and
communications, editing, storing backup tapes, and so on." By now,
though, "a lot of the things that made it difficult are negligible
costs". Talis, he says, focuses on selling services, not access to
data.

Enough people have protested for OCLC to convene a review board and
delay the planned 19 February implementation. However, few expect a
change of heart.

What we don't know - because we've never had the data to experiment
with - is what opportunities we're being denied. The National Library
of Sweden has put its entire catalogue on the web as linked data, the
first effort by a national library to become part of the semantic web.
It should have been the second: US Library of Congress staffer Ed
Summers was told to take down his similar experiment in December.
Karen Coyle, a librarian and consultant on digital libraries, says:
"If library records were open access on the web, it would be possible
to create bibliographies that go beyond the holdings of any one
library."

She points to Kosovo, where libraries have been destroyed in
generations of conflict. Open records, she says, "could create a
virtual library of books published in that geographical region, which
would allow scholars to study the literature and history of that area
in a way that isn't possible today with our separate, physical
libraries." Rob Styles, a programme manager for Talis's data services,
says: "The main reason I think libraries need freedom to innovate is
because we don't know what they're going to look like".