[open-bibliography] CUL dataset release

Tue Oct 19 21:15:24 UTC 2010

I agree with Jim that a small amount of information would make  
datasets like this more useful. It's like we need a "dublin core for  
data sets." I happen to be at this year's DC meeting, which starts  
tomorrow, and I know that there are folks here who are interested in  
metadata for data sets. I'll keep an ear to that ground to see if I  
can find out who is in that conversation.

The other thing that concerns me is updating of data sets. Some data  
is useful even without updates (especially if more data may come from  
different sources), but some data really needs to be updated to be  
useful (e.g. price data, ownership data, rights data). That gets us  
into the 'provenance' question that w3c is looking at, but also into  
versioning of data (so you know if it's from yesterday or 5 years ago).

This doesn't mean that it's not a good idea for people to put their  
data out there even if these details haven't been worked out. We need  
large amounts of data to experiment with this whole linked open data  
concept. Soon, though, I hope we'll get far enough along that we will  
want to resolve some of these other issues that will make the data  
even more useful. We can have the discussion today even if we can't  
resolve the issues for a while.

kc

Quoting Peter Murray-Rust <pm286 at cam.ac.uk>:

> Jim,
> I think this is typical of many projects where data is released into the
> Open. The key thing is that it has been Opened and that can never be undone.
>
> As an example if March the UK released a lot of Open data RDF under
> data.gov.uk.  There was a competition to see who could build the best
> mashup. Jim Downing put all his skills - not just in RDF but also in working
> with government and could find almost nothing that made immediate sense. We
> managed to work out there were 4218.657 cows in Northumberland. We never
> found out where the 0.43 cow went.
>
> Effectively the RDF was a dump of database tables without the schemas.
>
> In the following case the key thing is that the data have been released. Ben
> O'Steen will be able to give you a better idea as he is both an expert in
> RDF and also in bibliography. But my guess is that it's a dump of part of a
> proprietary library management system which is impenetrable and probably
> designed to create lockin. I also guess that they have done a dump and have
> relatively little control over what comes out. Also they have to make sure
> that the data is Open.
>
> On Tue, Oct 19, 2010 at 5:15 AM, Jim Pitman <pitman at stat.berkeley.edu>wrote:
>
>> Can someone please provide a description of this dataset? Some idea
>>
> of what range of years and subjects? Or how this dataset was collected or
>> conceived?
>> Of course its nice to see any dataset in PDDL. But it is not much of a
>> service to
>> release hundreds of thousands of records with no indications of what's in
>> there.
>>
>> Well yes it is. Firstly it has to be converted to RDF. That will give us a
> lot of experience in translating MARC or whatever variant it's in. Secondly
> we should be able to determine what it is by the content. I am increasingly
> uncovinced of the value of human cataloguing - machines do as good a job.
>
>> Is it expected that everyone on the list is going to jump in and see if
>> there's anything there
>> they care about?
>
>
> I'd be delighted to see this. It would be wornderful. And if we are really
> really focussed we cvould report the preliminary results at RLUK2010 in a
> month
>
>
>>  I'd like to see a higher standard of dataset description before dataset
>> release
>> announcments on this list.
>>
>
> That's neither fair nor possible. The data have been produced by CUL as a
> gift to the project. They have no dedicated finance for this and their work
> to extract and liberate the data is much appreciated .  The alternative
> would be to:
> * prepare a case for support for a 2-year release proposal.
> * get funded (improbable but who knows)
> * wait two years.
>
> So we are two years ahead. We can either regard this dataset as imperfect or
> we can exult at having got such a lot to work with. I take the latter view.
> If you don't feel happy looking at imperfect data, shut your eyes for a few
> months and see what we come up with.
>
>>
>> Or maybe we need some preliminary stage inviting volunteers to provide
>> dataset descriptions if
>> dataset providers are unwilling or unable to do so.
>>
>
> Please don't say "unwilling". They're not. They just don't have any spare
> resource.
>
> I think that volunteers are absolutely the key.  I think that if we create
> the right types of volunteer community then we can make the world's
> biblographic information Open within 2 years maximum.
>
> We would be delighted for you to be involved. But the mental approach has
> elements of uncovering an archeological site. Not completely, but my guess
> is that the library management systems are closed books and we have to
> decipher what comes out.
>
> And, in any case, suppose the collection was ALL the books in the library.
> Wouldn't that be a complete description.
>
> P.
>
>
>>
>> many thanks
>>
>> --Jim
>> ----------------------------------------------
>> Jim Pitman
>> Director, Bibliographic Knowledge Network Project
>> http://www.bibkn.org/
>>
>> Professor of Statistics and Mathematics
>> University of California
>> 367 Evans Hall # 3860
>> Berkeley, CA 94720-3860
>>
>> ph: 510-642-9970  fax: 510-642-7892
>> e-mail: pitman at stat.berkeley.edu
>> URL: http://www.stat.berkeley.edu/users/pitman
>>
>>
>> _______________________________________________
>> open-bibliography mailing list
>> open-bibliography at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-bibliography
>>
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

-- 
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet