Assign an event-time column to the streaming DataFrame and call withWatermark(eventTimeColumn, delayThreshold) before any aggregation or join; the threshold specifies how late data can arrive and still be included in a window.
Apply a windowed aggregation (window(), groupBy(), agg()) after the watermark declaration; Spark uses the watermark to determine when a window is final and can be emitted and removed from state.
Set output mode to append to emit only finalized windows (watermark guarantees no more updates), or update to emit partial results as they arrive; complete mode materializes the full result table and is memory-intensive for long-running streams.
Monitor watermark progress using the StreamingQuery.lastProgress or the Spark UI Structured Streaming tab to confirm the watermark is advancing; a stalled watermark (source data stop or skewed partitions) causes unbounded state growth.
For stream-stream joins, declare watermarks on both sides; Spark uses the minimum watermark across both streams to determine when matched pairs can be emitted and unmatched state can be dropped.
Tune spark.sql.streaming.statefulOperator.checkCorrectness.enabled and state store provider (RocksDB state store for large state) to balance memory use and recovery speed.
Known gotchas
Watermark is based on the maximum event time seen minus the delay threshold; if a single partition has no new data, the watermark does not advance for that partition, blocking state cleanup globally.
Late data arriving after the watermark threshold is silently dropped in append mode with no error; instrument a dead-letter or metrics counter to detect and monitor drop rates.
Changing the watermark threshold or window size requires a full restart of the streaming query from scratch (or a state migration) because the existing state store uses the old schema.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp