Before starting, review the target site's Terms of Service and robots.txt to confirm that automated access for your use case is permitted; consult legal counsel for commercial data collection
Implement a per-domain request queue with a minimum delay between requests (derived from robots.txt Crawl-delay or a conservative default such as 1–5 seconds) using a FIFO queue and setTimeout
Set a descriptive User-Agent header identifying your bot, its purpose, and a contact URL so site operators can reach you: page.setExtraHTTPHeaders({ 'User-Agent': 'MyBot/1.0 (+https://mycompany.com/bot)' })
Respect HTTP response signals: honor 429 (Too Many Requests) with exponential backoff, stop on 403 (Forbidden), and do not retry 404 unless the URL was expected to exist
Prefer fetching structured data feeds, APIs, sitemaps, or RSS/Atom where available instead of scraping rendered HTML — these are explicitly provided for programmatic access and impose less server load
Known gotchas
Rate limiting per domain is critical: multiple concurrent Playwright workers each making requests without coordination can collectively violate crawl-delay requirements — use a shared queue or semaphore across workers
Retry-After headers in 429 responses specify how long to wait before retrying; honor this header value instead of using your own backoff period when it is present
Even technically permitted scraping can become a ToS violation if it degrades site performance for real users; monitor your request rate against the site's observed capacity and throttle proactively
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp