[open-bibliography] Post about openbiblio data from Finland's Vaski consortia

Fri Oct 14 16:37:52 UTC 2011

Quoting Jim Pitman <pitman at stat.Berkeley.EDU>:

>> There's been a short discussion on the list for the Digital Public
>> Library of america about the fact that there is no reliable provenance
>> in CC licenses. They at least need to be digitally signed. So this
>> "who" question is inherent in CC.
>
> Yes, this is an interesting discussion on DPLA. Years ago Nelson  
> Beebe provided check sums
> on his BibTeX datasets for just this reason.

I've used checksums on datasets to make sure that the transmission was  
correct, but that's a transitory (pun?) use. The CC question is of the  
signature of an agent on the CC license so that you know who asserts  
the license terms. It's not a substitute for a checksum, which  
probably would need to be part of the license signature. It must  
identify the agent.

Because it is expected that bibliographic data will be re-used in  
innumerable ways, mashed-up, etc., I don't think that checksums on  
datasets will be of much use. Ideally, the W3C work on provenance and  
versioning for statements will come to fruition. Still, my feeling is  
that we will lose track of who said what except in a general wiki-like  
way, and we'll be dependent on the 'wisdom of crowds.'

BTW,
> Mark MacGillivray and I have agreed
> that until some better consensus emerges from the openbiblio  
> community, for purposes of BibJSON dev we are using the words
> "dataset" and "collection" interchangeably. This finnish deposit  
> exemplifies what we mean by either term. We are open to
> suggestion about how to distinguish the terms "collection" and  
> "dataset" for purposes of BibJSON/BibSoup.
> For reasons I have not yet understood,  the term "collection" seems  
> to set off alarm bells which "dataset" does not.

In libraries and archives collection implies a conscious collector  
(human), active curation, and a particular goal of completeness or at  
least boundary definitions. Dataset is neutral. Collections can be  
datasets, but many datasets are not collections. All of the records  
from a library may not be considered a collection since there is a  
point where libraries cannot carefully curate the whole. All of the  
records from LibraryThing would not be a collection, but the library  
of an individual LT user would be since it is curated.

It's like the difference between your personal address book and the  
phone company's phone book. The former would be a collection, the  
latter is just a bunch of data. (Interestingly, this distinction is  
similar to that used in US law regarding the copyright of data.)

Often, data that is gathered from different sources and put into a  
combined database loses the integrity of the individual input  
datasets, or at least loses that as an organizing principle. Archives  
keep collections separate precisely to prevent this from happening.  
Thus archives do not get "de-duped" because the individual items in  
their individual collections must be maintained for the integrity of  
the collection, even if there are copies in multiple collections.

So THAT'S why that term sets off alarms when you are dealing with  
library and archive folks.

kc

>
> It would be great if OKFN could promote some simple form of digital  
> signatures for open biblio records and datasets.
> This should also encourage those wishing to make improvements of  
> large open datasets to do so by publishing
> diffs or increments. This should hopefully reduce the problem of  
> duplication of records, and make us welcome and
> encourage copying of records rather than fearing it.
>
> Digital signatures raise issues about what is the canonical form of  
> a structured text dataset,
> be it encoded as BibTeX or XML or JSON or whatever. If we are going  
> to recommend checksums on canonical forms, we should
> be ensure the canonical form has desirable technical properties,  
> like the metadata and license being in predictable places near the  
> top of the file,
> from which they could easily be extracted as standalone and equally  
> valid metatdata only records separated from the data.
> This is especially important if the data is in a big file.
>
> --Jim
>
> ----------------------------------------------
> Jim Pitman
> Professor of Statistics and Mathematics
> University of California
> 367 Evans Hall # 3860
> Berkeley, CA 94720-3860
>
> ph: 510-642-9970  fax: 510-642-7892
> e-mail: pitman at stat.berkeley.edu
> URL: http://www.stat.berkeley.edu/users/pitman
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>

-- 
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet