[open-bibliography] CUL dataset release

Karen Coyle kcoyle at kcoyle.net
Tue Oct 19 21:47:38 UTC 2010


Quoting Jim Pitman <pitman at stat.Berkeley.EDU>:

  I find the OL terms of use statement
> http://www.archive.org/about/terms.php
> fairly opaque. OL used to have a statement that they regarded their   
> data as public
> domain, but  this seems to have been withdrawn. Maybe Karen could   
> comment on this?

I don't know why the terms of use was changed, but it's clear that the  
Internet Archive's terms come out of the mouths of lawyers whose job  
is the keep the IA out of trouble (or at least to make sure the  
trouble doesn't stick -- the Archive has been through a number of  
significant lawsuits). However, note that when you manually edit a  
record, you see this:

"By saving a change to this wiki, you agree that your contribution is  
given freely to the world under CC0. Yippee!"

Any site where data is compiled from a number of sources is going to  
find it difficult to make statements about licensing or public domain  
-- if you aren't the creator of the data, it is very unclear what  
rights you can assert. The long terms of use on the IA page is really  
saying: "It's not our stuff, so we have nothing to say about its  
rights." In essence, OL could say the same -- they have simply  
gathered data from a number of sources, some of which did not give  
explicit permission.

This means that data sets like CUL have an advantage over OL because  
the library is able to make assertions about rights for records it  
created, purchased, or significantly modified.

And, BTW, it's actually a Good Thing that the CUL dataset has mainly  
records for odd-ball things that only they own -- these records are  
less likely to overlap with other datasets, and more likely to provide  
new data. We've got enough metadata for Harry Potter books; we should  
encourage libraries to contribute to the very long tail of published  
works.

kc



>
>> I would very much favour a wiki or similar for CKAN packages, where data
>> triage and characterisation can be discussed on a per-package basis.
>
> My strong support for this.
> To avoid duplication of effort it is important to have a central   
> place where agents planning or
> conducting data triage and characterisations can report their plans   
> and activities. Then we should
> try to encourage a culture in this group to post to that place, and   
> to look there before embarking
> on such efforts.
>
> I also strongly agree with Peter that we should encourage dumps of   
> open data first, and
> ask questions about quantity and quality later. The rate of dumps of  
>  library data is currently
> exceeding the capacity of this community to process them into usable  
>  forms. But that should not slow
> down the open dumping.  I think that the process of data cleaning   
> and enhancement by agents with
> domain-specific interests would be accelerated if some agents could   
> do preliminary triage and
> characterisations and report on the results. Then other agents might  
>  be motivated to step in and
> improve the data further.
>
> --Jim
>
>
>>
>> On Mon, 2010-10-18 at 21:15 -0700, Jim Pitman wrote:
>> > Can someone please provide a description of this dataset? Some idea
>> > of what range of years and subjects? Or how this dataset was   
>> collected or conceived?
>> > Of course its nice to see any dataset in PDDL. But it is not much  
>>  of a service to
>> > release hundreds of thousands of records with no indications of   
>> what's in there.
>> >
>> > Is it expected that everyone on the list is going to jump in and   
>> see if there's anything there
>> > they care about?  I'd like to see a higher standard of dataset   
>> description before dataset release
>> > announcments on this list.
>> >
>> > Or maybe we need some preliminary stage inviting volunteers to   
>> provide dataset descriptions if
>> > dataset providers are unwilling or unable to do so.
>> >
>> > many thanks
>> >
>> > --Jim
>> > ----------------------------------------------
>> > Jim Pitman
>> > Director, Bibliographic Knowledge Network Project
>> > http://www.bibkn.org/
>> >
>> > Professor of Statistics and Mathematics
>> > University of California
>> > 367 Evans Hall # 3860
>> > Berkeley, CA 94720-3860
>> >
>> > ph: 510-642-9970  fax: 510-642-7892
>> > e-mail: pitman at stat.berkeley.edu
>> > URL: http://www.stat.berkeley.edu/users/pitman
>> >
>> >
>> > _______________________________________________
>> > open-bibliography mailing list
>> > open-bibliography at lists.okfn.org
>> > http://lists.okfn.org/mailman/listinfo/open-bibliography
>>
>>
>>
>> _______________________________________________
>> open-bibliography mailing list
>> open-bibliography at lists.okfn.org
>> http://lists.okfn.org/mailman/listinfo/open-bibliography
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>



-- 
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet





More information about the open-bibliography mailing list