Parse the event timestamp field and cast it to TimestampType in your streaming DataFrame.
Apply a watermark: df.withWatermark('event_time', '10 minutes') tells Spark to tolerate up to 10 minutes of late data and to advance state cleanup accordingly.
Apply a window aggregation: df.groupBy(window('event_time', '5 minutes'), 'key').agg(count('*').alias('cnt')).
Write with outputMode('append') to emit only finalized windows (after watermark passes window end + late threshold) or outputMode('update') for partial results.
Monitor state store size and watermark progress via Spark UI Structured Streaming tab.
Known gotchas
Without a watermark, Spark retains state for all past windows indefinitely, eventually causing OOM or excessive state store growth.
The watermark threshold is a lower bound on lateness tolerance, not an upper bound; Spark may emit a window slightly later than expected.
Sliding windows with short slides create overlapping windows that multiply state store entries; keep slide >= window/10 for reasonable overhead.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp