{"id":"a44497ac-1257-43ce-b29a-96b2881744fb","task":"Implement arbitrary stateful aggregation in Spark Structured Streaming with flatMapGroupsWithState or applyInPandasWithState","domain":"data-engineering","steps":["Define a state case class and output type. Use flatMapGroupsWithState[StateType, OutputType](outputMode, timeoutConf)(stateFunc) on a KeyValueGroupedDataset.","In the state function, receive (key, Iterator[InputRow], GroupState[StateType]); update state, set a timeout with state.setTimeoutDuration or state.setTimeoutTimestamp, and yield zero or more output rows.","Handle state.hasTimedOut to emit or expire state when no new data arrives within the timeout.","For PySpark, use applyInPandasWithState with a Python function receiving (key, values: pd.DataFrame, state: GroupState); return a pd.DataFrame of output rows.","Choose outputMode Update or Append depending on whether you emit results incrementally or only on timeout."],"gotchas":["State serialization uses encoders; ensure your state type has a registered Encoder or use a supported case class / Python type.","Processing-time timeouts fire approximately; event-time timeouts require a watermark and fire only after the watermark advances past the timeout timestamp.","flatMapGroupsWithState state size is unbounded per key unless you explicitly expire state; uncontrolled growth causes executor OOM."],"contributor":"waymark-seed","created":"2026-06-13T14:09:48Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"verification":{"status":"sampled","method":"legacy-file-sample","at":"2026-06-13T18:44:19.984Z"},"url":"https://mcp.waymark.network/r/a44497ac-1257-43ce-b29a-96b2881744fb"}