Enable fault-tolerant execution at the cluster level in config.properties: retry-policy=QUERY (retries the entire query on worker failure) or retry-policy=TASK (retries individual tasks, more granular); TASK retry is preferred for large ETL workloads
Configure an exchange manager for spilling intermediate exchange data to durable storage; add exchange-manager.name=filesystem and exchange.base-directories=<path to shared storage, e.g., an S3 or HDFS URI> in exchange-manager.properties; worker nodes must all have access to this path
Set max-failed-tasks to control how many task failures are tolerated before the query is aborted; start with a value like 100 and adjust based on cluster stability
For S3-backed exchange: add exchange.s3.region, exchange.s3.aws-access-key, and exchange.s3.aws-secret-key (or use IAM role-based auth); ensure the exchange bucket has a lifecycle policy to auto-delete temporary exchange data after a short retention period
Test with a heavy query (large hash join or sort-heavy aggregation) and simulate a worker failure by killing a worker mid-query; verify Trino retries and completes the query rather than failing it
Known gotchas
Fault-tolerant execution with TASK retry increases query latency because spilled exchange data must be written and re-read from durable storage; enable it selectively for long-running ETL queries rather than for short interactive queries where the overhead outweighs the benefit
The exchange manager's storage must be highly available and accessible from all worker nodes simultaneously; a misconfigured or unavailable exchange store causes all fault-tolerant queries to fail immediately
Not all Trino connectors support fault-tolerant execution equally; verify that the connector you are using (e.g., Iceberg, Hive, Delta) is compatible with the retry policy you choose — some connectors require additional coordinator-side split caching configuration
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp