Set the state backend to EmbeddedRocksDBStateBackend in the Flink job configuration or via the flink-conf.yaml
Enable incremental checkpointing by setting state.backend.incremental to true so that only changed SST files are uploaded to the checkpoint store on each checkpoint
Configure the checkpoint interval and timeout to balance recovery point objective against checkpoint overhead
Set the number of retained checkpoints and enable state.backend.rocksdb.memory.managed to let Flink manage RocksDB memory within the TaskManager heap budget
After a job failure, verify that Flink restores from the latest completed incremental checkpoint and that the restored state matches the expected key count
Known gotchas
Incremental checkpoints accumulate SST file references across multiple checkpoints; a checkpoint is only self-contained after a full checkpoint cycle, meaning early checkpoint deletion can cause recovery failures if intermediate SST files have been removed
RocksDB compaction runs asynchronously and can cause spikes in I/O and CPU on TaskManagers; tune rocksdb.compaction.level.max-size-multiplier and background thread counts to prevent compaction stalls from delaying checkpoints
Restoring from a savepoint (not a checkpoint) always performs a full state transfer regardless of the incremental setting; savepoints are not incremental and can be very large for jobs with deep state
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp