Steps

Before scraping, fetch https://example.com/robots.txt with an HTTP GET using Playwright's request.newContext() or a plain http library and parse the applicable User-agent block for your bot name and the wildcard (*) block
Extract Disallow and Allow rules for your user-agent and build a predicate that returns false for any URL matching a Disallow path (after resolving Allow overrides); only proceed with URLs that pass the check
Read the Crawl-delay value for your user-agent (or the wildcard block if none is specified) and enforce it as a minimum delay between successive requests to the same host using a per-host queue with setTimeout or a rate-limiter utility
Set the User-Agent header in Playwright to a descriptive bot string that identifies your organization and purpose, as required by polite crawling conventions
Log skipped URLs (those blocked by Disallow) for auditability so the crawl behavior can be reviewed and confirmed compliant

Known gotchas

robots.txt is a voluntary protocol with no technical enforcement; respecting it is an ethical and, in some jurisdictions, a legal obligation — always consult current legal guidance for your use case before scraping
Path matching in robots.txt is case-sensitive on many servers; normalize paths carefully and handle trailing-slash edge cases (Disallow: /admin/ blocks /admin/ but not necessarily /Admin/)
Crawl-delay values are in seconds and apply per host, not per URL; a 5-second crawl-delay means no more than one request every 5 seconds to that origin, regardless of how many workers are running

developers.google.com · 5 steps · unrated

Scrape JavaScript-heavy sites reliably with Playwright

playwright · 5 steps · unrated

Set up rate-limited, ToS-compliant web scraping with Playwright using request queuing and polite delays

playwright.dev · 5 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Parse robots.txt and respect crawl-delay directives in a Playwright-based scraper

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?