[openspending-dev] Fwd: [Webmaster Tools] http://openspending.org/: Googlebot can't access your site
Nick Stenning
nick.stenning at okfn.org
Mon Feb 17 17:09:57 UTC 2014
On Mon, Feb 17, 2014, at 9:39, Rufus Pollock wrote:
> This keeps on being the case - I've tried a bunch of things and suspect
> that something may be going on at our hoster. It looks like you can curl
> the robots.txt perfectly well (even when pretending to be googlebot). Any
> thoughts on how to fix very welcome!
openspending.org has an invalid robots.txt. See
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
for a formal definition of how robots.txt is parsed by GoogleBot.
Specifically, "Allow:" and "Disallow:" lines have right-hand sides which
are "pathvalues":
pathvalue = "/" path
Regardless of any other considerations, this means that pathvalues must
start with a literal forward-slash.
$ curl -L http://openspending.org/robots.txt
User-agent: *
Allow: *
I've submitted a PR to fix this here:
https://github.com/openspending/openspending/pull/768
Rather surprisingly, this appears to have been introduced at:
https://github.com/openspending/openspending/commit/43055a77889efb509985fe9db1236c1cbc5da8ec
which would imply that this has been broken for over a year...!?
-N
--
Nick Stenning
Technical Director, Open Knowledge Foundation
More information about the openspending-dev
mailing list