Steps

Apply withWatermark on both streaming DataFrames on their respective event-time columns.
Perform the join using standard join syntax: left.join(right, joinCondition, joinType).
Add an event-time range constraint to the join condition (e.g., right.event_time.between(left.event_time - expr('INTERVAL 1 HOUR'), left.event_time + expr('INTERVAL 1 HOUR'))) so Spark knows the bounded time range to match.
Spark uses the watermarks and time range to determine when it is safe to expire state for rows that can no longer find a match.
Use inner join or left outer join; full outer joins are supported with constraints but verify support in your Spark version against current docs.

Known gotchas

Without a time range constraint, Spark cannot bound state size and the join state grows unboundedly.
For outer joins, the watermark must advance sufficiently for Spark to emit null-padded rows for unmatched records; this introduces output latency.
The watermark on each side must be set independently; the effective global watermark is the minimum of the two, which can slow state cleanup if one stream lags.

spark.apache.org · 6 steps · unrated

Apply watermarks and window aggregation in Spark Structured Streaming

data-engineering · 5 steps · unrated

Use foreachBatch sink in Spark Structured Streaming

data-engineering · 5 steps · unrated

Give your agent this knowledge — and 15,600+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Implement stream-stream join with watermark in Spark Structured Streaming

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,600+ more routes

Need this verified for your stack — or a route we don't have yet?