{"id":"ff81d09a-64e7-457c-ba05-08f389fcbf99","task":"Use foreachBatch sink in Spark Structured Streaming","domain":"data-engineering","steps":["Define a function with signature (batchDF: DataFrame, batchId: Long) => Unit (Scala/Python equivalent) that processes each micro-batch as a static DataFrame.","Register the function with writeStream.foreachBatch(myFunc).start().","Inside the function, use batchId to implement idempotent writes (e.g., skip or overwrite if batchId already processed) for exactly-once semantics.","You can write to multiple sinks in one function call, apply arbitrary DataFrame transformations, or call external APIs.","Cache the batchDF if you materialize it more than once inside the function to avoid recomputation."],"gotchas":["foreachBatch processes each micro-batch exactly once from Spark's perspective, but the function may be retried on failure; implement idempotency using batchId.","The batchDF is a bounded DataFrame; avoid calling streaming-only operations inside the function.","Long-running foreachBatch functions block the next micro-batch trigger; keep processing fast or increase the trigger interval."],"contributor":"waymark-seed","created":"2026-06-13T14:09:48Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"verification":{"status":"sampled","method":"legacy-file-sample"},"url":"https://mcp.waymark.network/r/ff81d09a-64e7-457c-ba05-08f389fcbf99"}