[openbiblio-dev] [bibserver] changing handling of collections from frontend to get them working properly again now that collections are stored as separate objects (937d3e2)

Sun Oct 9 16:13:04 UTC 2011

WW wrote:
> >> Whether this is a special case two-level thing, or we allow users to build matrioshka dolls is another question.

JP wrote:

> > Collections MUST be allowed to have more than two levels. The most obvious example is for journal articles

Rufus Pollock <rufus.pollock at okfn.org> wrote:

> I wonder if we are misusing "collection" here. For me a collection is just like a bibliography, or even more simply, a bunch of works /
> records I want to collect together.

Indeed, so it is for me. The examples of publisher/journal/volume/issue are exemplars of what I want to be able to easily access as
collections on BibSoup. Most of these collections already exist, they have more or less stable urls on the web, but their data is not
easily machine-readable and reusable. We should aim to correct that in BibSoup, but not at the expense of losing the integrity of
existing collections, especially those for which it is particularly simple to script place-holder metadata records for  BibSoup.

> Collection in the sense of "all issues or this journal or all articles in this journal" while they could be represented as collection might
> better be represented in their own right.

I think they are primarily collections, and for technical purposes should be treated as such. But they are collections of a special type, which we should recognize, and
which should be indicated in the collection metadata. 
We should also accomodate other sorts of subject or collector specific collections, like "All articles in mathematics" according to some agent's judgement/opionion,
or all articles indexed by some service, or all articles in mathematics that Jim Pitman puts in a personal collection.  These are all collections, and we should
have a simple, flexible way of specifying that.

> > publisher/journal/volume/issue
> > Especially if the publisher is a small professional society, the publisher level collection is very important: it establishes
> > the identity of the society, and is a collection the society might be proud to host on its own website in association with BibSoup.
> > Respect for such collections must be provided to accomodate the needs of publishers for recognition, and to engage them in contributing
> > data to BibSoup.
> > Each level in the publisher/journal/volume/issue hierarchy demands a metadata record, with attributes depending on level in the hierarchy.  It should be
> > required that each such metadata record either contains the list of at least identifiers and human-readable titles of its children, or
> > provides a pointer to such a list as a  BibJSON dataset, or both.
>
> Right, this sounds pretty complex :-) 

Actually no, it is very simple, and already accomodated by standard A&I service records.

> I strongly suggest sitting down and writing out in detail the user stories here using our existing
> spreadsheet (or creating a new document). Doing user stories would also focus us on what people actually want to do rather than focusing
> on the details of the modelling which would come out of the user stories rather than the other way round).

No problem there, and I can provide 3 basic user stories immediately.

1) Provide and maintain  a decent index of the IMS/Bernoulli open access electronic journals.  There are 5 of them, split 2/3 by technical
structure of currently available metadata. I have made tentative starts at work on this.

2) Same for all IMS/Bernoulli journals, which are semi-open,  meaning the content is open if you know where to look for it, but otherwise
somewhat hidden.  

3) For these journals, provide a citation index, meaning a complete list of items cited in the reference lists of these journals, with some
effort at deduplication of items.

Each of these exercises provides  us with multiple examples of collections. Among the more interesting and novel collections are e.g.
"All articles ever cited in the Annals of Probability", "All articles ever cited in IMS Journals" and the same for books or other types.
Some of these collections may be defined programmatically, e.g. by queries in some QL, but they still deserve a metadata record saying what
the queries mean and enabling a user to browse around such collections and know what they are looking at.
This is very close to the issue of providing metadata records for every query to BibSoup. I think this is required, and that some queries may
be dignified by a higher standing than others. Those with higher standing should aquire more attributes as collections.

> > OAI-ORE  http://www.openarchives.org/ore/ defines standards for the description and exchange of aggregations of Web resources.
> > This is too heavy for the needs of BibSoup, but something we must accomodate at least to the extent that it provides useful
> > input of collections metadata to BibSoup. Providing OAI-ORE compliant export from BibSoup is something we might be able to get grant support for,
> > but not something we should attempt without additional funding.
> > I am not sure how much uptake of OAI-ORE there has been, but I think there are some major nested collections e.g. part or
> > all of JSTOR which have been mapped to it.  This should be further investigated.
>
> Are you volunteering to do this research :-)

No. Need someone more familiar with RDF than I am. I have been burned too many times by RDF to want to touch 
it again soon.  Need a cook more experienced with hot flames. 
Any volunteers?

> > parts of the journal, but typically contains whole volumes and issues.
> > Whatever standard BibSoup adopts for collections MUST be able to accomodate the structure of these existing large high quality nested collections.
> Again we need user stories and use cases here with sufficient detail.

Fine, I will be glad to follow up as needed for the IMS journals use case. But it seems pretty obvious to me what the attributes
of such existing collections are, and should not take long to specify collections metadata for them.

> > We wont immediately be able to drop all of JSTOR metadata into BibSoup, but I know I can do this for parts of JSTOR, especially the
> > metadata of particular publishers like IMS, Bernoulli, and some others where I have connections.  I am starting to work on IMS data,
> > and this could be available within weeks as publisher/journal/volume/issue metadata for upload to BibSoup.  This would provide a good test of collections capability.
> Great.

Will continue on this. I already have some examples of complete journal runs I can provide as soon as you are ready for 
managing such things in BibSoup. A huge amount of data like this can be scripted out of the Microsoft Academic Search API as well.
Peter and I will be talking to Microsoft folks about open licensing issues tomorrow. Hope to have something to report after that.

> > http://imstat.org/publications/ for the top level of the hierarchy with the list of journals. Note also the further structure of
> > IMS Journals and Publications
> > IMS Co-sponsored Journals and Publications
> > IMS Supported Journals
> > IMS Affiliated Journals
> > each of which defines a collection. These vary in their integrity and the level of interest for supporting them as a collection in BibSoup.
> > But there should be no technical obstacle to providing such support.
> > Other collections I am aware of:  Departmental collections, which naturally split by type e.g. (techreport, book, thesis, article, .... ) and author.
> > Collections are not always nested. The collection of all works of an author (or even all such works known to some source or collector)
> > is an important case which cross cuts all the other collections mentioned above.
>
> I really think we want to focus on the user stories first and then
> decide whether one concept/ / domain object (e.g. "Collection") is
> sufficient for the domain we are trying to cover.

OK. I am already convinced we need to go with a very general, flexible concept of "Collection" which is capable of adapting to
whatever use case we can throw at it. Mathematically, collections are nothing more or less than sets of records. We need a way to allow
users to simply specify whatever sets they care about. If there are simple boolean relations between sets, especially A subset of B and
A disjoint from B, there must be simple ways to indicate this in the collection metadata. Its as simple as that. Do not need any more use
cases to commit to respecting those structures in the data model. I think it is mostly a matter of *where* these structures are kept, and I think
the answer is simple, you either put this info directly in the meta record for a collection, or you link out to this info (e.g. as a query response) 
from the collection record.
I think we should just try installing simple collection metadata support on the above lines, and then start exercising it with actual
use cases, for which the publisher collection and departmental collections are exemplary.

--Jim