Stand up a full sandbox environment that mirrors production: PSP sandbox account (Stripe test mode or equivalent), a test wallet service with test budgets, a mock merchant or test e-commerce store, and a test audit log.
Use PSP-provided test card numbers and test scenarios: simulate successful charges, declines by reason code (insufficient funds, do-not-honor, card expired, 3DS required), network errors, and delayed responses; your agent must handle each scenario correctly.
Test the idempotency layer explicitly: submit the same payment request twice with the same idempotency key and verify only one charge appears; submit the same request twice with different keys and verify two charges appear — then verify your deduplication catches the second one.
Test approval gate flows end-to-end: trigger the above-threshold path, simulate a human approving via the approval link, and verify the agent proceeds correctly; also simulate timeout and rejection paths.
Test the 3DS required scenario: use a test card that triggers 3DS, verify the agent suspends and notifies the human, simulate human completion, and verify the agent resumes correctly.
Before promoting to production, run a synthetic load test against the sandbox that mimics expected peak agent concurrency; validate that the wallet service's concurrency controls prevent overdrafts under load.
Known gotchas
PSP sandbox environments do not perfectly replicate production network behavior — decline rates, 3DS trigger rates, and settlement timing all differ; sandbox passing is necessary but not sufficient, and you should expect a shakeout period in production.
Test data hygiene matters: if your sandbox shares a database schema with staging or has a path to production infrastructure, a misconfigured environment variable can route real money through a 'test' flow — enforce environment tagging at the wallet service level with a hard block on real PSP credentials in non-production environments.
Agents under test may behave differently than in production if the sandbox responses are unrealistically fast or always successful; inject artificial latency and failure rates matching production p95/p99 latency to stress-test retry and timeout logic.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp