Set a checkpoint location on a durable filesystem (HDFS, GCS, S3, ADLS) via .option('checkpointLocation', 'path/to/checkpoint') in writeStream.
Spark writes query metadata (offsets, committed offsets) and state store snapshots to this location on each micro-batch commit.
On job restart with the same checkpoint location, Spark resumes from the last committed offset automatically, providing at-least-once delivery (exactly-once with idempotent sinks).
To recover from a corrupted checkpoint, delete the checkpoint directory and restart from a known safe offset; this risks reprocessing or gaps.
Test recovery by deliberately killing the job mid-batch and restarting; verify output deduplication or idempotency.
Known gotchas
Changing query schema or adding stateful operators while reusing an existing checkpoint can cause incompatibility errors; use a fresh checkpoint after significant schema changes.
Checkpoint writes on each micro-batch add latency proportional to filesystem write speed; use a low-latency store for high-throughput jobs.
State store checkpoints can grow large for stateful queries; configure periodic state store maintenance and monitor checkpoint directory size.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp