Choose a deduplication key (e.g., event_id, idempotency_key) that uniquely identifies a logical event.
In Flink SQL, use a ROW_NUMBER() window function partitioned by the dedup key and ordered by event_time, then filter WHERE row_num = 1 in a downstream view.
In Flink DataStream API, use a KeyedProcessFunction keyed on the dedup key; store a flag in ValueState<Boolean> and set a timer to clear it after the dedup window expires (state TTL).
Configure state TTL via StateTtlConfig.newBuilder(Time.hours(<n>)).setUpdateType(UpdateType.OnCreateAndWrite).build() to automatically purge state for keys not seen recently.
In Kafka Streams, use a persistent KeyValueStore to track seen IDs and a punctuator or TTL tombstone to expire old entries.
Test dedup effectiveness by replaying duplicate events and verifying exactly one output per logical event.
Known gotchas
State TTL must be longer than the maximum expected duplicate arrival window; setting it too short causes deduplication to fail for late duplicates.
ROW_NUMBER dedup in Flink SQL works best on bounded or mini-batch contexts; for purely streaming unbounded dedup, the DataStream KeyedProcessFunction approach with explicit TTL is more reliable.
Dedup state size scales with the number of unique keys seen within the TTL window; profile state store memory usage under peak cardinality before deploying.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp