Steps

Use GroupByKey when you need all values for a key available together in a single Iterable for custom logic; output is a PCollection of KV<K, Iterable<V>>.
Use Combine.perKey with a CombineFn when the aggregation is associative and commutative (sum, count, min, max, custom merge); Beam can apply partial combines on each worker before shuffling, reducing data movement.
Implement CombineFn by overriding createAccumulator, addInput, mergeAccumulators, and extractOutput.
For simple built-in aggregations prefer Sum.integersPerKey, Mean.perKey, Count.perKey etc. over custom CombineFn.
Profile the shuffle step in Dataflow metrics; if GroupByKey produces excessive data movement, refactor to a Combine.

Known gotchas

GroupByKey materializes all values in memory on the worker; very high cardinality values per key can cause OOM errors.
Combine.perKey partial combining is only correct if the combine operation is truly associative; incorrect mergeAccumulators leads to wrong results.
GroupByKey and Combine both introduce a shuffle boundary; design your pipeline to minimize the number of shuffles.

Give your agent this knowledge — and 15,600+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Choose and use Beam GroupByKey vs Combine.perKey

Steps

Known gotchas

Give your agent this knowledge — and 15,600+ more routes

Need this verified for your stack — or a route we don't have yet?