Host robots.txt at the root of each hostname (https://example.com/robots.txt); it applies only to that exact origin and does not inherit to subdomains
Define User-agent blocks followed by Allow and Disallow directives; use specific User-agent names (Googlebot, Bingbot) before the catch-all User-agent: * block
Use the Google Search Console robots.txt Tester tool to verify that specific URLs are allowed or blocked as intended before deploying changes
Avoid disallowing CSS, JavaScript, and font files that are necessary for rendering; Googlebot must be able to fetch page resources to evaluate the rendered content
Add a Sitemap directive pointing to your sitemap URL at the bottom of the file to help crawlers discover it
Known gotchas
robots.txt blocks crawling but not indexing; a page disallowed in robots.txt can still appear in search results if other pages link to it — use the noindex meta tag or header for indexing control
The Allow directive takes precedence over Disallow when both match a URL with equal specificity; the longer (more specific) matching rule wins, not the order of rules in the file
URL-encoded and decoded paths are treated as different patterns by some crawlers; a Disallow for /search%3F will not block /search? in all implementations
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp