Distinguish between robots.txt Disallow and HTML noindex directives, and configure robots.txt user-agent entries to control AI training crawlers separately from search crawlers
Understand the fundamental difference: 'Disallow' in robots.txt prevents crawling (Googlebot will not fetch the URL), while 'noindex' in a meta tag or X-Robots-Tag prevents indexing (Googlebot fetches the URL but excludes it from the index)
Add specific user-agent blocks for AI training crawlers: 'User-agent: GPTBot\nDisallow: /' blocks OpenAI's training crawler; 'User-agent: Google-Extended\nDisallow: /' blocks Google's AI training use separate from Search
Add 'User-agent: ClaudeBot\nDisallow: /' to block Anthropic's training crawler; use 'User-agent: CCBot\nDisallow: /' for Common Crawl
Keep your Googlebot rules in a separate user-agent block from AI crawler rules to avoid accidentally affecting Search indexing when editing AI crawler policies
Validate the final robots.txt using Google's robots.txt Tester in Search Console to confirm Googlebot rules are unaffected after adding AI crawler blocks
Test that a URL disallowed for GPTBot is still crawlable by Googlebot by checking with the URL Inspection tool — the two user-agent blocks must be independent
Known gotchas
A URL disallowed in robots.txt for Googlebot can still appear in Google Search results if external sites link to it — Google will show the URL without a snippet, noting the description is unavailable; use noindex on a crawlable page to fully remove it from search results
Robots.txt is advisory and not enforced — reputable crawlers (Googlebot, GPTBot, ClaudeBot) respect it, but rogue or non-compliant bots may ignore it entirely; robots.txt is not a security boundary
Adding 'Disallow: /' for Google-Extended does not affect Googlebot's ability to crawl and index pages for Search; they are separate user-agents with separate purposes
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp