[open-bibliography] CUL dataset release

Tue Oct 19 07:15:28 UTC 2010

Jim,
I think this is typical of many projects where data is released into the
Open. The key thing is that it has been Opened and that can never be undone.

As an example if March the UK released a lot of Open data RDF under
data.gov.uk.  There was a competition to see who could build the best
mashup. Jim Downing put all his skills - not just in RDF but also in working
with government and could find almost nothing that made immediate sense. We
managed to work out there were 4218.657 cows in Northumberland. We never
found out where the 0.43 cow went.

Effectively the RDF was a dump of database tables without the schemas.

In the following case the key thing is that the data have been released. Ben
O'Steen will be able to give you a better idea as he is both an expert in
RDF and also in bibliography. But my guess is that it's a dump of part of a
proprietary library management system which is impenetrable and probably
designed to create lockin. I also guess that they have done a dump and have
relatively little control over what comes out. Also they have to make sure
that the data is Open.

On Tue, Oct 19, 2010 at 5:15 AM, Jim Pitman <pitman at stat.berkeley.edu>wrote:

> Can someone please provide a description of this dataset? Some idea
>
of what range of years and subjects? Or how this dataset was collected or
> conceived?
> Of course its nice to see any dataset in PDDL. But it is not much of a
> service to
> release hundreds of thousands of records with no indications of what's in
> there.
>
> Well yes it is. Firstly it has to be converted to RDF. That will give us a
lot of experience in translating MARC or whatever variant it's in. Secondly
we should be able to determine what it is by the content. I am increasingly
uncovinced of the value of human cataloguing - machines do as good a job.

> Is it expected that everyone on the list is going to jump in and see if
> there's anything there
> they care about?

I'd be delighted to see this. It would be wornderful. And if we are really
really focussed we cvould report the preliminary results at RLUK2010 in a
month

>  I'd like to see a higher standard of dataset description before dataset
> release
> announcments on this list.
>

That's neither fair nor possible. The data have been produced by CUL as a
gift to the project. They have no dedicated finance for this and their work
to extract and liberate the data is much appreciated .  The alternative
would be to:
* prepare a case for support for a 2-year release proposal.
* get funded (improbable but who knows)
* wait two years.

So we are two years ahead. We can either regard this dataset as imperfect or
we can exult at having got such a lot to work with. I take the latter view.
If you don't feel happy looking at imperfect data, shut your eyes for a few
months and see what we come up with.

>
> Or maybe we need some preliminary stage inviting volunteers to provide
> dataset descriptions if
> dataset providers are unwilling or unable to do so.
>

Please don't say "unwilling". They're not. They just don't have any spare
resource.

I think that volunteers are absolutely the key.  I think that if we create
the right types of volunteer community then we can make the world's
biblographic information Open within 2 years maximum.

We would be delighted for you to be involved. But the mental approach has
elements of uncovering an archeological site. Not completely, but my guess
is that the library management systems are closed books and we have to
decipher what comes out.

And, in any case, suppose the collection was ALL the books in the library.
Wouldn't that be a complete description.

P.

>
> many thanks
>
> --Jim
> ----------------------------------------------
> Jim Pitman
> Director, Bibliographic Knowledge Network Project
> http://www.bibkn.org/
>
> Professor of Statistics and Mathematics
> University of California
> 367 Evans Hall # 3860
> Berkeley, CA 94720-3860
>
> ph: 510-642-9970  fax: 510-642-7892
> e-mail: pitman at stat.berkeley.edu
> URL: http://www.stat.berkeley.edu/users/pitman
>
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/open-bibliography/attachments/20101019/d1454952/attachment-0002.html>