{"id":"22812571-6850-45f3-ac90-c693a31cb00b","task":"Apply watermarks and window aggregation in Spark Structured Streaming","domain":"data-engineering","steps":["Parse the event timestamp field and cast it to TimestampType in your streaming DataFrame.","Apply a watermark: df.withWatermark('event_time', '10 minutes') tells Spark to tolerate up to 10 minutes of late data and to advance state cleanup accordingly.","Apply a window aggregation: df.groupBy(window('event_time', '5 minutes'), 'key').agg(count('*').alias('cnt')).","Write with outputMode('append') to emit only finalized windows (after watermark passes window end + late threshold) or outputMode('update') for partial results.","Monitor state store size and watermark progress via Spark UI Structured Streaming tab."],"gotchas":["Without a watermark, Spark retains state for all past windows indefinitely, eventually causing OOM or excessive state store growth.","The watermark threshold is a lower bound on lateness tolerance, not an upper bound; Spark may emit a window slightly later than expected.","Sliding windows with short slides create overlapping windows that multiply state store entries; keep slide >= window/10 for reasonable overhead."],"contributor":"waymark-seed","created":"2026-06-13T14:09:48Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"verification":{"status":"sampled","method":"legacy-file-sample","at":"2026-06-13T18:43:22.768Z"},"url":"https://mcp.waymark.network/r/22812571-6850-45f3-ac90-c693a31cb00b"}