[openbiblio-dev] First cut of AsyncUpload branch

Jim Pitman pitman at stat.Berkeley.EDU
Fri Feb 10 15:59:46 UTC 2012

Etienne and all.

This all sounds excellent to me and entirely consistent with what I wanted to see
from the bibserver/parser component.  

I have one suggestion of pratice I have found extremely useful in my parser work and which I'd like to see provided
as a small but expected supplement to each source-specific parser.

That is, that each parser should contain a method which when offered a url will
respond with the canonical form of the url from which the parser knows how to scrape data, and the API call
for doing that. Users may or may not want the parser to continue from there.
This applies only to source-specific parsers. As an example, if user cares about 
the article commonly cited as "arXiv:1201.6450" or located at http://arxiv.org/abs/1201.6450  (abstract) and
either http://arxiv.org/pdf/1201.6450  or http://arxiv.org/pdf/1201.6450v1  (pdf) the parser should return the information
from any of these strings "arXiv:1201.6450", "http://arxiv.org/abs/1201.6450", "http://arxiv.org/pdf/1201.6450", "http://arxiv.org/pdf/1201.6450v1"
the parser should return the information that the canonical url associated with this article is "http://arxiv.org/abs/1201.6450"  
(or whichever of the above the parser contributor decides is the most canonical url), and also that the call to the arXiv API to get XML metadata for that article
is some other url (which I know how to create, but dont specify here). These are preliminaries or alternatives to making that API call
to get the XML and thence the BibJSON.

Typical use case is that user scrapes a bunch of arXiv ids from somewhere as part of a collection. Often user will have other metadata about these
items, and no immediate need for the full arXiv metadata.
User should be able to immediately map these ids to their canonical form for use in deduplication/disambiguation and the like.
This use of the id does not require actually making the API call and invoking the parser. It just requires knowing what is the canonical form of the id,
and that is something the parser should know about.

The main point is that such knowledge about accessing of arXiv, or any other such resource, should be kept as methods in the
arXiv parser, rather than bibserver users or maintainers being expected to track this knowledge in some other place.

Consistent adoption of this convention has been working very well for me, and enabled me to work more efficiently on parsers and 
identifiers for multiple sources, also to develop a prototype bookmarklet/server interaction where user visiting a page containing biblio
data clicks a bookmarklet which passes the server the url of the page, and the server then queries all the parser modules it has to see if any of them recognize 
the url, and the first one which does responds with the metadata from the url.


Jim Pitman
Professor of Statistics and Mathematics
University of California
367 Evans Hall # 3860
Berkeley, CA 94720-3860

ph: 510-642-9970  fax: 510-642-7892
e-mail: pitman at stat.berkeley.edu
URL: http://www.stat.berkeley.edu/users/pitman

More information about the openbiblio-dev mailing list