Before scraping, fetch https://example.com/robots.txt with an HTTP GET using Playwright's request.newContext() or a plain http library and parse the applicable User-agent block for your bot name and the wildcard (*) block
Extract Disallow and Allow rules for your user-agent and build a predicate that returns false for any URL matching a Disallow path (after resolving Allow overrides); only proceed with URLs that pass the check
Read the Crawl-delay value for your user-agent (or the wildcard block if none is specified) and enforce it as a minimum delay between successive requests to the same host using a per-host queue with setTimeout or a rate-limiter utility
Set the User-Agent header in Playwright to a descriptive bot string that identifies your organization and purpose, as required by polite crawling conventions
Log skipped URLs (those blocked by Disallow) for auditability so the crawl behavior can be reviewed and confirmed compliant
Known gotchas
robots.txt is a voluntary protocol with no technical enforcement; respecting it is an ethical and, in some jurisdictions, a legal obligation — always consult current legal guidance for your use case before scraping
Path matching in robots.txt is case-sensitive on many servers; normalize paths carefully and handle trailing-slash edge cases (Disallow: /admin/ blocks /admin/ but not necessarily /Admin/)
Crawl-delay values are in seconds and apply per host, not per URL; a 5-second crawl-delay means no more than one request every 5 seconds to that origin, regardless of how many workers are running
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp