[openbiblio-dev] [bibserver] changing handling of collections from frontend to get them working properly again now that collections are stored as separate objects (937d3e2)

Jim Pitman pitman at stat.Berkeley.EDU
Sun Oct 9 16:49:06 UTC 2011


Karen Coyle <kcoyle at kcoyle.net> wrote:

> Quoting Rufus Pollock <rufus.pollock at okfn.org>:
>
> > I wonder if we are misusing "collection" here. For me a collection is
> > just like a bibliography, or even more simply, a bunch of works /
> > records I want to collect together.
> >
> > Collection in the sense of "all issues or this journal or all articles
> > in this journal" while they could be represented as collection might
> > better be represented in their own right.
>
> I, too, am somewhat flumoxed by the emphasis on collections here. If a  
> collection is just any group of metadata from a single source, then  
> it's not a terribly meaningful or useful grouping. 

If you mean by "any group of metadata from a single source" any subset of data from a single
source, I agree. But sources know what subsets are meaningful to them, and should be allowed and
encouraged to provide metadata at various levels, e.g. publisher/journal/vol/issue that they
care about.

> If a collection is  a set of metadata that has been chosen for some purpose, then it is  
> closer to my concept of collection. 

Yes, mine too.

> In the end, however, there will be  metadata records for resources, and those records may be found in  
> multiple collections. Will the collections be "undone" in the  database, or will the records retain their existence in a collection?  

Not sure I follow the question, but if some contributor has bothered to create a metadata record for some collection, it should
be there in the database. Other contributors might make duplicate records. But this is no different in principle than deduping any other
kind of record.  I dont see this as a big problem though. There is very little ROI from replicating whole collections and their meta
records, unless there is some enhancement of data.

> How will the database handle items that are in multiple collections?  

I dont know, but I dont think its such an issue for the DB, using a NoSQL model, its more an issue for deduping, indexing and search, which
can be incrementally improved.

> And, as Rufus asks, what is the use of collections to users?

Some indications in previous. Also, it is not just users, but data providers who are potential funders in a sustainable model for BibSoup
whose collections we need to respect. If a funder cares about a collection, BibSoup should have the capability of easily tagging all items
in that collection and making views over it.  
The simple device of allowing tags on doc level items,  and providing taggers with a way to make meta records for the collections they tag,
seems a very simple way to support collections. Compare http://delicious.com/

> >> publisher/journal/volume/issue
> This may be one logical way to store some data, but hierarchy tends to  
> constrain potential services. You do NOT want to have to know the  
> publisher in order to find the journal, obviously.

Agreed! 

> In addition, there are publishers and there are publishers. Some are  
> professional organizations like ACM, but others are mere corporations,  
> like Elsevier or Nature. There will be folks who have published in  
> Time Magazine or the New Yorker, and you don't want to exclude that.

Sure. It is work to annotate articles by publisher, and this is not required. But we may get more support from
some publishers if we allow all  of their items to be easily found in BibSoup. This will apply to smaller publishers.
Bigger publishers may not want this, and may even try to prevent this, as they would rather users went to their own
indexing services.

> I would tend to record this information, probably with a different  
> data element for professional or governmental organizations as  
> publishers or sponsors, but not use it as a way to organize the data.  

Agreed, in general, for reasons above. But it should not be prevented, also for reasons above.
Main point. Adopting a general notion of collection makes it dead easy to have collections for some
publishers, to the extent they support or care about it, and not for others. I am working on behalf
of a particular small publisher, IMS, which will I am sure want to be able to see its own collections.

> Across different communities there are just too many different  
> relationships of bodies to publications to make something like this  work. Think broadly about the world of publication.

I agree, we should not *require* organization by publisher. But it should be easily *allowed* when desired,
and that is a primary use case for collections.

> >> Each level in the publisher/journal/volume/issue hierarchy demands  
> >> a metadata record, with attributes depending on level in the  
> >> hierarchy.  It should be
> >> required that each such metadata record either contains the list of  
> >> at least identifiers and human-readable titles of its children, or
> >> provides a pointer to such a list as a  BibJSON dataset, or both.
> > Right, this sounds pretty complex :-)
> Oy! Let's not make requirements that will discourage or even prevent  input. 

Point taken, but for the complete journal run use case, the data is right there when
you script out of the existing containers and into BibSoup, so why not use it?
Really bad idea to lose it.

> What information do people actually have on hand when they are  creating the metadata? (And how accurate is it? Probably not even 95%)
Clearly varies greatly, but for my use cases and for most journal listing the metadata already exists and I would guess is
more like 99% accurate for elements of real importance (like whether a link resolves to get you to the abstract or full text of the item).
Minor errors in titles or authors  might be a bit more common.

> BTW, if you want to create records for each level, there are library  records that contain only the publishing pattern for each journal that  
> has been cataloged. Those pattern records can be used to create a full  
> set of journal/volume/issue, but I have to warn you that there are  
> more levels than volume/issue -- the library data allows for 6 (!)  
> such levels, but they are fully defined, with their display components  
> (part, number, season, date, whatever) if you want. (I'll try to find  
> where these records are... they're kind of background data for the  
> issue predictor systems that allow libraries to know if they've missed  an issue.)

Very useful and relevant. An issue may be our rights to use any of this info. The matter of avoiding
missing issues is of particular importance to journals which have ad hoc special issues which are easily
lost. Often these are interesting and important issues, and they are poorly indexed. Things like
"All special volumes produced by the Applied Probability Trust" is the sort of collection of great interest to
me which is abysmally indexed by current A&I services.
Or "All conversations in Statistical Science" or "All obituaries in IMS JOurnals" or "All review articles in IMS JOurnals",
these are all interesting collections which we should make an effort to accomodate.
This does not mean that we have to support such collections across all journals. It just means that we 
should allow users and data providers to indicate such collections, and we should make it easy not hard for them to do so.

--Jim




More information about the openbiblio-dev mailing list