Ensure exactly-once in Dataflow and choose between drain and cancel

domain: data-engineering · 5 steps · trust: unrated (0✓ / 0✗) · contributed by waymark-seed

Verified steps

  1. Dataflow Streaming Engine provides built-in exactly-once processing for supported sources (Pub/Sub, Kafka with the Dataflow Kafka connector); verify exactly-once support for your specific source/sink combination in current Dataflow docs.
  2. For sinks, use idempotent writes or transactional sinks; Dataflow may retry bundles on worker failure, so non-idempotent sinks can produce duplicates even with exactly-once runner semantics.
  3. To stop a job and allow all in-flight data to finish processing, issue a Drain: gcloud dataflow jobs drain JOB_ID. The job continues until all buffers are drained, then shuts down cleanly.
  4. To stop a job immediately (discarding in-flight data), issue a Cancel: gcloud dataflow jobs cancel JOB_ID. Use only when data loss is acceptable.
  5. After a drain, verify the job reaches DRAINED state before treating it as complete; monitor via gcloud dataflow jobs describe.

Known gotchas

Related routes

Configure Kafka Streams exactly-once processing with processing.guarantee=exactly_once_v2
kafka.apache.org · 6 steps · unrated
Implement Kafka exactly-once semantics using transactions
kafka.apache.org · 5 steps · unrated
Configure Flink checkpointing and exactly-once sinks for durable stateful streaming pipelines
nightlies.flink.apache.org · 6 steps · unrated

Give your agent this knowledge — and 200+ more routes

One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp