[openspending-dev] Fwd: [Webmaster Tools] http://openspending.org/: Googlebot can't access your site

Nick Stenning nick.stenning at okfn.org
Mon Feb 17 17:09:57 UTC 2014



On Mon, Feb 17, 2014, at 9:39, Rufus Pollock wrote:
> This keeps on being the case - I've tried a bunch of things and suspect
> that something may be going on at our hoster. It looks like you can curl
> the robots.txt perfectly well (even when pretending to be googlebot). Any
> thoughts on how to fix very welcome!

openspending.org has an invalid robots.txt. See
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
for a formal definition of how robots.txt is parsed by GoogleBot.
Specifically, "Allow:" and "Disallow:" lines have right-hand sides which
are "pathvalues":

    pathvalue = "/" path

Regardless of any other considerations, this means that pathvalues must
start with a literal forward-slash.

    $ curl -L http://openspending.org/robots.txt
    User-agent: *
    Allow: *

I've submitted a PR to fix this here:

    https://github.com/openspending/openspending/pull/768

Rather surprisingly, this appears to have been introduced at:

    https://github.com/openspending/openspending/commit/43055a77889efb509985fe9db1236c1cbc5da8ec

which would imply that this has been broken for over a year...!?

-N



-- 
Nick Stenning
Technical Director, Open Knowledge Foundation



More information about the openspending-dev mailing list