[openbiblio-dev] First cut of AsyncUpload branch

Etienne Posthumus etienne.posthumus at okfn.org
Mon Feb 13 10:11:53 UTC 2012


On 10 February 2012 16:59, Jim Pitman <pitman at stat.berkeley.edu> wrote:
> I have one suggestion of pratice I have found extremely useful in my parser work and which I'd like to see provided
> as a small but expected supplement to each source-specific parser.
>
> That is, that each parser should contain a method which when offered a url will
> respond with the canonical form of the url from which the parser knows how to scrape data, and the API call
> for doing that. Users may or may not want the parser to continue from there.

Jim, do I understand it correctly that you suggest some sort of
'string-sniffing' support in ALL the parsers?
IOW, when called in some manner as a convention, eg.
someparser -s "arXiv:1201.6450"

it returns some structured output along the lines of:

{ "recogised" : true/false,
"canonical":"http://arxiv.org/abs/1201.6450",
"metadata":"http://somemetadataurl"}

Can you contribute a simple Python script that does what you suggest?
(no parsing needed yet)
Then we can see if this is general enough to recommend as a convention
for other parser/scrapers.




More information about the openbiblio-dev mailing list