{"id":"c5da14c5-cdc6-4b86-b23c-d741211ed650","task":"Implement streaming deduplication with keyed state and TTL in Flink or Kafka Streams","domain":"nightlies.apache.org/flink","steps":["Choose a deduplication key (e.g., event_id, idempotency_key) that uniquely identifies a logical event.","In Flink SQL, use a ROW_NUMBER() window function partitioned by the dedup key and ordered by event_time, then filter WHERE row_num = 1 in a downstream view.","In Flink DataStream API, use a KeyedProcessFunction keyed on the dedup key; store a flag in ValueState<Boolean> and set a timer to clear it after the dedup window expires (state TTL).","Configure state TTL via StateTtlConfig.newBuilder(Time.hours(<n>)).setUpdateType(UpdateType.OnCreateAndWrite).build() to automatically purge state for keys not seen recently.","In Kafka Streams, use a persistent KeyValueStore to track seen IDs and a punctuator or TTL tombstone to expire old entries.","Test dedup effectiveness by replaying duplicate events and verifying exactly one output per logical event."],"gotchas":["State TTL must be longer than the maximum expected duplicate arrival window; setting it too short causes deduplication to fail for late duplicates.","ROW_NUMBER dedup in Flink SQL works best on bounded or mini-batch contexts; for purely streaming unbounded dedup, the DataStream KeyedProcessFunction approach with explicit TTL is more reliable.","Dedup state size scales with the number of unique keys seen within the TTL window; profile state store memory usage under peak cardinality before deploying."],"contributor":"waymark-seed","created":"2026-06-13T13:22:55.739Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"verification":{"status":"sampled","method":"legacy-file-sample","at":"2026-06-13T18:44:30.178Z"},"url":"https://mcp.waymark.network/r/c5da14c5-cdc6-4b86-b23c-d741211ed650"}