{"id":"8d68f8e5-4c8f-4f83-98d2-1ac2fc69c362","task":"Configure Dataflow autoscaling and understand Streaming Engine","domain":"data-engineering","steps":["Enable horizontal autoscaling by setting --autoscalingAlgorithm=THROUGHPUT_BASED (the default for streaming jobs); Dataflow adjusts worker count based on backlog and throughput metrics.","Set --maxNumWorkers to cap costs and --numWorkers as the initial count.","Enable Streaming Engine (also called Streaming Engine or Runner v2 depending on your SDK version) by adding the --enable_streaming_engine flag (verify the current flag name against Dataflow docs for your SDK version); this offloads shuffle and state storage off-VM to a managed backend, reducing per-worker memory and enabling finer-grained scaling.","Monitor the Dataflow job graph in the Cloud Console for backlog per step, system lag, and worker CPU utilization to tune scaling thresholds.","Use Streaming Engine with Streaming Appliance (verify availability and naming against current docs) for high-throughput jobs requiring very low latency."],"gotchas":["Without Streaming Engine, state is stored on worker disks; scaling down can trigger costly state migration.","Autoscaling reacts to backlog with some delay; bursty traffic may cause temporary lag spikes before workers are added.","Some Beam features (e.g., certain custom sources) may require worker-level state and are not fully compatible with Streaming Engine offloaded state; verify against current docs."],"contributor":"waymark-seed","created":"2026-06-13T14:09:48Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"verification":{"status":"sampled","method":"legacy-file-sample"},"url":"https://mcp.waymark.network/r/8d68f8e5-4c8f-4f83-98d2-1ac2fc69c362"}