[open-bibliography] Post about openbiblio data from Finland's Vaski consortia
Karen Coyle
lists at kcoyle.net
Fri Oct 14 16:37:52 UTC 2011
Quoting Jim Pitman <pitman at stat.Berkeley.EDU>:
>> There's been a short discussion on the list for the Digital Public
>> Library of america about the fact that there is no reliable provenance
>> in CC licenses. They at least need to be digitally signed. So this
>> "who" question is inherent in CC.
>
> Yes, this is an interesting discussion on DPLA. Years ago Nelson
> Beebe provided check sums
> on his BibTeX datasets for just this reason.
I've used checksums on datasets to make sure that the transmission was
correct, but that's a transitory (pun?) use. The CC question is of the
signature of an agent on the CC license so that you know who asserts
the license terms. It's not a substitute for a checksum, which
probably would need to be part of the license signature. It must
identify the agent.
Because it is expected that bibliographic data will be re-used in
innumerable ways, mashed-up, etc., I don't think that checksums on
datasets will be of much use. Ideally, the W3C work on provenance and
versioning for statements will come to fruition. Still, my feeling is
that we will lose track of who said what except in a general wiki-like
way, and we'll be dependent on the 'wisdom of crowds.'
BTW,
> Mark MacGillivray and I have agreed
> that until some better consensus emerges from the openbiblio
> community, for purposes of BibJSON dev we are using the words
> "dataset" and "collection" interchangeably. This finnish deposit
> exemplifies what we mean by either term. We are open to
> suggestion about how to distinguish the terms "collection" and
> "dataset" for purposes of BibJSON/BibSoup.
> For reasons I have not yet understood, the term "collection" seems
> to set off alarm bells which "dataset" does not.
In libraries and archives collection implies a conscious collector
(human), active curation, and a particular goal of completeness or at
least boundary definitions. Dataset is neutral. Collections can be
datasets, but many datasets are not collections. All of the records
from a library may not be considered a collection since there is a
point where libraries cannot carefully curate the whole. All of the
records from LibraryThing would not be a collection, but the library
of an individual LT user would be since it is curated.
It's like the difference between your personal address book and the
phone company's phone book. The former would be a collection, the
latter is just a bunch of data. (Interestingly, this distinction is
similar to that used in US law regarding the copyright of data.)
Often, data that is gathered from different sources and put into a
combined database loses the integrity of the individual input
datasets, or at least loses that as an organizing principle. Archives
keep collections separate precisely to prevent this from happening.
Thus archives do not get "de-duped" because the individual items in
their individual collections must be maintained for the integrity of
the collection, even if there are copies in multiple collections.
So THAT'S why that term sets off alarms when you are dealing with
library and archive folks.
kc
>
> It would be great if OKFN could promote some simple form of digital
> signatures for open biblio records and datasets.
> This should also encourage those wishing to make improvements of
> large open datasets to do so by publishing
> diffs or increments. This should hopefully reduce the problem of
> duplication of records, and make us welcome and
> encourage copying of records rather than fearing it.
>
> Digital signatures raise issues about what is the canonical form of
> a structured text dataset,
> be it encoded as BibTeX or XML or JSON or whatever. If we are going
> to recommend checksums on canonical forms, we should
> be ensure the canonical form has desirable technical properties,
> like the metadata and license being in predictable places near the
> top of the file,
> from which they could easily be extracted as standalone and equally
> valid metatdata only records separated from the data.
> This is especially important if the data is in a big file.
>
> --Jim
>
> ----------------------------------------------
> Jim Pitman
> Professor of Statistics and Mathematics
> University of California
> 367 Evans Hall # 3860
> Berkeley, CA 94720-3860
>
> ph: 510-642-9970 fax: 510-642-7892
> e-mail: pitman at stat.berkeley.edu
> URL: http://www.stat.berkeley.edu/users/pitman
>
> _______________________________________________
> open-bibliography mailing list
> open-bibliography at lists.okfn.org
> http://lists.okfn.org/mailman/listinfo/open-bibliography
>
--
Karen Coyle
kcoyle at kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
More information about the open-bibliography
mailing list