[openbiblio-dev] Open Biblio call tomorrow

Tue Feb 7 19:56:27 UTC 2012

On Tue, Feb 7, 2012 at 7:39 PM, Jim Pitman <pitman at stat.berkeley.edu> wrote:
> Excellent. Strong support from me. The sort of thing I want to be able to do with these
> big national collections is pull all records related to particular subjects of particular authors.
> Hopefully that will be facilitated by putting the data into elastic search. This does not seem easy with the
> data as RDF. I have asked on the list many times how to do this, and never got a reply. Hopefully we can
> demo this functionality with a BibServer instance dedicated to each National Biblio.
> Mark, was that your idea? Or do you think you can merge all of these into a single BibSoup instance?

Yes we are already doing this with some biblios.

> At some stage this must run us into performance issues, I am not sure when.

We could actually put LOTS of these together - but we are not doing
that yet. An instance per national bib for now.

> A clear goal. But we are not there yet, right? As far as I know we have not yet reached the milestone of a public facing BibServer
> not controlled by Mark/OKF. I continue to press for that. There are important social barriers to overcome as well as technical ones.
> I would love to see a growing list of where these installations are. Start with 1, learn from the socio-technical issues involved with that,
> and keep pressing.

Yes, it is doable but nobody else has done. But as it by definition is
outwith my / OKF control, we cannot force it. Peter is discussing with
PMC, you are discussing with Berkeley, we can but wait and see.

> I find BibSoup as presently setup an excellent proving ground for various biblio display efforts, like a sandbox,  but I
> dont see much future for it without some further partitioning/replication and bibliographic control.
> It is too easy for people to post low quality datasets, or datasets under development. I think that is very useful, and a
> way of attracting users to BibServer, but it is not the same as a well-organized well-curated collections which I hope we
> can start to see emerging soon. Things like the Malaria dataset and the Probability Web dataset should help focus on that.

Yes, on the way.

>> The interactive breakthrough will come (I think) when we can easily annotate records (I am thinking by adding new fields).
>
> Yes. But we need to be very thoughtful about the data model for this to work well. I think the right data model is to allow that
> agents like MathSciNet, PubMed, Google Scholar and others provide fairly stable records, and even more stable identifiers, and to distinguish
> what are essentially just copies of these records, which should be acknowledged to preserve provenance, and further derivative records.
> Users, in their own collections, should be able to easily supplement such a record from any source with a correction in a field or two, and with supplementary fields.
> But the more common and effective use case will be for a user to create a new composite record from whatever records of the same object are out there.
> Daniel Hook's Symplectic software does a great job of this merging.  And I have some passes at this too. Essentially, this creates a new record, which the user owns, and
> which inherits some properties from the source records, and other properties which might be edits by hand, or provided with some machine processing, e.g. automated name-disambiguation
> or subject classification.  This is a difficult area of mangaging workflows for bibliographic data enhancement, but one which may be very rewarding.

In progress.

Mark