Enable checkpointing in the StreamExecutionEnvironment: set a checkpoint interval appropriate for your latency/durability tradeoff, set CheckpointingMode.EXACTLY_ONCE, and configure a state backend (RocksDB for large state, heap for small).
Point checkpoint storage to a durable remote store (HDFS, S3, GCS) by configuring the checkpoint directory; local storage is lost on task manager failure.
Set minimum pause between checkpoints and checkpoint timeout to prevent checkpoint storms; if a checkpoint takes longer than the timeout, Flink aborts it and retries.
Use a sink that implements the TwoPhaseCommitSinkFunction (or the new Sink API with a Committer) to integrate exactly-once guarantees with transactional targets such as Kafka, JDBC, or Iceberg.
Configure max concurrent checkpoints to 1 during normal operation to reduce state backend contention; increase only if the checkpoint interval is much longer than individual checkpoint duration.
Enable unaligned checkpoints if your pipeline has long-running barriers due to backpressure, but verify that your sink's pre-commit phase can tolerate the resulting ordering semantics.
Known gotchas
Exactly-once with a two-phase commit sink means a checkpoint failure will cause a rollback to the last completed checkpoint; the sink will re-emit records between the failed and last-successful checkpoint, so the sink backend must handle idempotent re-delivery.
RocksDB incremental checkpoints reduce checkpoint size but require the full checkpoint history chain to restore; losing intermediate checkpoints invalidates the restore path.
Savepoints are not automatic; you must trigger them manually or via the REST API before upgrades—regular checkpoints alone do not provide a stable restore point for application-level changes.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp