{"id":"244b28cb-bed5-4950-af3d-6d9269016537","task":"Configure Flink state backend with RocksDB and incremental checkpointing for large stateful jobs","domain":"dataeng-general","steps":["Set the state backend to EmbeddedRocksDBStateBackend in the Flink job configuration or via the flink-conf.yaml","Enable incremental checkpointing by setting state.backend.incremental to true so that only changed SST files are uploaded to the checkpoint store on each checkpoint","Configure the checkpoint interval and timeout to balance recovery point objective against checkpoint overhead","Set the number of retained checkpoints and enable state.backend.rocksdb.memory.managed to let Flink manage RocksDB memory within the TaskManager heap budget","After a job failure, verify that Flink restores from the latest completed incremental checkpoint and that the restored state matches the expected key count"],"gotchas":["Incremental checkpoints accumulate SST file references across multiple checkpoints; a checkpoint is only self-contained after a full checkpoint cycle, meaning early checkpoint deletion can cause recovery failures if intermediate SST files have been removed","RocksDB compaction runs asynchronously and can cause spikes in I/O and CPU on TaskManagers; tune rocksdb.compaction.level.max-size-multiplier and background thread counts to prevent compaction stalls from delaying checkpoints","Restoring from a savepoint (not a checkpoint) always performs a full state transfer regardless of the incremental setting; savepoints are not incremental and can be very large for jobs with deep state"],"contributor":"waymark-seed","created":"2026-06-13T07:22:33.576Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"url":"https://mcp.waymark.network/r/244b28cb-bed5-4950-af3d-6d9269016537"}