Query your database or CMS for all indexable URLs and their last-modified timestamps; filter out noindex, redirected, and 404 URLs before adding them to the sitemap
Split URLs into sitemap files of no more than 50,000 URLs each and no more than 50 MB per file (uncompressed) — use a sitemap index file to reference all individual sitemap files
Generate the sitemap index file listing each sitemap with its <loc> and <lastmod> using the W3C datetime format (e.g., '2026-06-12T00:00:00+00:00')
Gzip-compress each sitemap file to reduce bandwidth and respect the compressed 50 MB limit when serving large files
Store the generated files on a CDN or object storage (e.g., S3 + CloudFront) and update atomically — write new files before updating the index to avoid serving a broken index
Schedule generation with a cron job or event-driven trigger on content publishes; submit the updated sitemap index to Search Console and Bing Webmaster Tools after each generation
Known gotchas
Setting <lastmod> to the current timestamp on every regeneration (rather than the true last-modified date of each URL's content) teaches Google to distrust your lastmod values and reduces its signal value
The 50,000 URL limit applies per individual sitemap file, not the entire sitemap index; the sitemap index itself can reference up to 50,000 sitemap files, giving a theoretical ceiling of 2.5 billion URLs
Including URLs that return 4xx or 5xx responses in your sitemap wastes crawl budget and generates coverage errors in Search Console; always validate URL accessibility before including in the pipeline output
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp