Measure the maximum processing time for your transactional loop (from beginTransaction to commitTransaction) under load, including downstream calls and retries
Set transaction.timeout.ms on the producer to a value comfortably above that measured maximum — the default is 60000 ms (60 seconds)
Verify the broker's transaction.max.timeout.ms (default 900000 ms / 15 minutes); if transaction.timeout.ms exceeds this value, the initTransactions() call will fail with InvalidTxnTimeoutException
Set enable.idempotence=true (required for transactions) and configure transactional.id to a unique stable string per producer instance to allow the coordinator to fence zombie producers
Monitor the kafka_server_transaction_coordinator_metrics_transaction_failure_rate JMX metric; spikes indicate transactions are timing out or being fenced
On the consumer side, set isolation.level=read_committed so consumers only see records from committed transactions and are not exposed to aborted transaction data
Known gotchas
A transaction aborted by the coordinator due to timeout does not notify the producer synchronously; the producer discovers the abort only on the next commitTransaction() or send() call, which throws an InvalidProducerEpochException
Increasing transaction.timeout.ms delays zombie detection: if a producer crashes, the coordinator waits the full timeout before aborting the open transaction and allowing a fenced replacement to proceed
Consumers with isolation.level=read_uncommitted will read records from aborted transactions, breaking exactly-once guarantees; always set read_committed on consuming applications in an EOS pipeline
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp