[open-bibliography] Metadata aggregators, discovery tools and libraries

Sun Jan 23 05:47:44 UTC 2011

Peter Murray-Rust <pm286 at cam.ac.uk> wrote:

> My immediate anaylssis is that they have collected links to the publishers
> sites. They do not have bibliographic records per se. I think we will still
> have to scrape pages.  Typical example:
>
> ACS Chemical biology:
> http://www.journaltocs.ac.uk/index.php?action=browse&subAction=subjects&publisherID=32&journalID=128&pageb=1&userQueryID=&sort=
>
>    - *Identification of SR8278, a Synthetic Antagonist of the Nuclear Heme
>    Receptor REV-ERB*<http://www.journaltocs.ac.uk/articleHomePage.php?id=2947548&userID=0>
>       - *Authors:* *Douglas Kojetin; Yongjun Wang, Theodore M. Kamenecka
>       Thomas P. Burris*
>       *Abstract:*  [image, deleted by PMR]
>       -
>       - ACS Chemical Biology
>       DOI : 10.1021/cb1002575
>       *PubDate:* 2010-11-10T14:18:32Z
>       [image: Export to
> Refworks]<http://www.refworks.com/express/expressimport.asp?vendor=JournalTOCs&filter=Refworks%20Tagged%20Format&encoding=65001&url=http%3A//www.journaltocs.ac.uk/exports/refworks.php%3FitemID=2947548_0>

This is their interface for human view. They also have an API documented at http://www.journaltocs.ac.uk/index.php?action=api
which returns RSS. e.g. for the above item:    http://www.journaltocs.ac.uk/api/journals/128?output=articles returns:

<item rdf:about="http://feedproxy.google.com/~r/acs/acbcct/~3/JLzxwbDsGHQ/cb1002575">
<title>Identification of SR8278, a Synthetic Antagonist of the Nuclear Heme Receptor REV-ERB</title><0d>
<link>http://feedproxy.google.com/~r/acs/acbcct/~3/JLzxwbDsGHQ/cb1002575</link><0d>
<description><img src="http://pubs.acs.org/appl/literatum/publisher/achs/journals/content/acbcct/0/acbcct.ahead-of-print/cb1002575/aop/images/medium/cb-2010-002575_0005.gif" alt="TOC Graphic"/><div><cite>ACS Ch>
<a href="http://feeds.feedburner.com/~ff/acs/acbcct'a=JLzxwbDsGHQ:a6oKlqdggtw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/acs/acbcct'd=yIl2AUoC8zA" border="0"></img></a><0d>
<img src="http://feeds.feedburner.com/~r/acs/acbcct/~4/JLzxwbDsGHQ" height="1" width="1"/></description><0d>
<dc:identifier>http://feedproxy.google.com/~r/acs/acbcct/~3/JLzxwbDsGHQ/cb1002575</dc:identifier><0d>
<dc:creator>Douglas Kojetin Yongjun Wang, Theodore M. Kamenecka Thomas P. Burris</dc:creator>
<dc:date>2010-11-10T14:18:32Z</dc:date>
<dc:source>ACS Chemical Biology, Vol. , No. (2010) pp. - </dc:source>
<dc:publisher>American Chemical Society (ACS)</dc:publisher>
<prism:PublicationName>ACS Chemical Biology</prism:PublicationName>
<prism:publicationDate>2010-11-10T14:18:32Z</prism:publicationDate>
<content:encoded><![CDATA[<a href="http://feedproxy.google.com/~r/acs/acbcct/~3/JLzxwbDsGHQ/cb1002575">Identification of SR8278, a Synthetic Antagonist of the Nuclear Heme Receptor REV-ERB</A> Douglas Kojetin Yongjun Wang, Theodore>
<a href="http://feeds.feedburner.com/~ff/acs/acbcct'a=JLzxwbDsGHQ:a6oKlqdggtw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/acs/acbcct'd=yIl2AUoC8zA" border="0"></img></a><0d>
<img src="http://feeds.feedburner.com/~r/acs/acbcct/~4/JLzxwbDsGHQ" height="1" width="1"/>]]></content:encoded><0d>
</item>

> The actual journal has
>
> Identification of SR8278, a Synthetic Antagonist of the Nuclear Heme
> Receptor REV-ERB
>
> Douglas Kojetin, Yongjun Wang, Theodore M. Kamenecka, and Thomas P.
> Burris*<http://pubs.acs.org/doi/abs/10.1021/cb1002575#cor1>
> The Scripps Research Institute, Jupiter, Florida 33458, United States
> ACS Chem. Biol., Article ASAP
> *DOI: *10.1021/cb1002575
> Publication Date (Web): November 2, 2010
> Copyright © 2010 American Chemical Society
>
> Note the affiliation (not in JournalTOCs) - which is very important for us.

Right, the affiliation is missing in the RSS.  Still, the data is usable for a great many purposes, and if an author claimed
the data with Krichel's authorclaim or similar, something better than the affiliation is obtained.

> To extract the JournalTOCs metadata we have to:
> * iterate over subjects (ca 20)
> * iterate over journals (which includes multiple pages)
We dont have to page: its just one call to the API, as above
> * extract the data for the issue
Not too bad from the RSS.

> I can't see any evidence of back issues. It seems that they only expose the
> current issues from these publishers (maybe I'm missing something).

This is more serious. But it appears there is nothing to stop us taking e.g. weekly or monthly snapshots
and aggregating a cache of back issues.

> I am not sure whether they have taken the publishers RSS or whether they are
> specifically given TOCs by the publishers.  If the former then presumably we can also do that.

Right, but we do not need more than one agent to acquire and aggregate publisher's data, whatever format it
arrives in. If is its exported in a common format with an API, that seems like a great service.
Certainly for my purposes I would rather deal with aggregated data from an API like this with a common format and license
than dealing with publishers separately.
--Jim

----------------------------------------------
Jim Pitman
Director, Bibliographic Knowledge Network Project
http://www.bibkn.org/

Professor of Statistics and Mathematics
University of California
367 Evans Hall # 3860
Berkeley, CA 94720-3860

ph: 510-642-9970  fax: 510-642-7892
e-mail: pitman at stat.berkeley.edu
URL: http://www.stat.berkeley.edu/users/pitman