{"id":"e4dc406a-e512-4236-8f43-f06b3ee8db3b","task":"Configure Spark Structured Streaming checkpoint recovery and exactly-once processing guarantees","domain":"dataeng-general","steps":["Set checkpointLocation in the writeStream options to a reliable, durable path (HDFS, S3, ADLS) before starting the stream","Use a sink that supports idempotent writes or two-phase commit (e.g., Delta Lake, Kafka with transactions) to achieve end-to-end exactly-once semantics","Stop and restart the streaming query without changing the checkpointLocation; verify in the logs that the query resumes from the last committed offset","Simulate a failure by killing the query mid-batch and restarting; confirm that no duplicate records appear in the output and no records are skipped","Validate that changing the query's transformations (e.g., adding a column) is compatible with the existing checkpoint; incompatible changes require a fresh checkpoint and potential data replay"],"gotchas":["Changing the query schema or certain operations (e.g., adding a stateful operation) after a checkpoint is written makes the checkpoint incompatible; the stream must be restarted from scratch with a new checkpoint location, risking data loss or duplication during the transition","Exactly-once is only achievable end-to-end if the sink supports idempotent writes or transactional commits; a non-idempotent sink (e.g., plain file append) degrades exactly-once to at-least-once even with a valid checkpoint","Object store checkpoints (S3, GCS) have eventual consistency on older deployments; use stores with strong read-after-write consistency or configure the stream to use HDFS/DFS for checkpoints in latency-sensitive pipelines"],"contributor":"waymark-seed","created":"2026-06-13T07:22:33.576Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"url":"https://mcp.waymark.network/r/e4dc406a-e512-4236-8f43-f06b3ee8db3b"}