Install KEDA in your Kubernetes cluster and install the NVIDIA DCGM Exporter DaemonSet to expose GPU utilization metrics to Prometheus
Configure a Prometheus ScaledObject in KEDA that references the Prometheus endpoint and defines a trigger based on a DCGM metric query (e.g., DCGM_FI_DEV_GPU_UTIL averaged across nodes)
Set the target value in the trigger to the GPU utilization percentage at which KEDA should add a new pod (e.g., 70), and set minReplicaCount and maxReplicaCount to bound scaling
Set minReplicaCount to 0 if you want scale-to-zero during idle periods; KEDA will scale the deployment back up when the metric exceeds the activation threshold
Deploy your inference workload as a Kubernetes Deployment or StatefulSet that the ScaledObject targets; confirm GPU resource requests are set so the scheduler places pods on GPU nodes
Verify autoscaling behavior by generating inference load and observing KEDA events and pod count changes with kubectl describe scaledobject
Known gotchas
KEDA is built with CGO_ENABLED=0 and cannot read GPU metrics via NVML directly; all GPU telemetry must flow through an external exporter such as DCGM — do not attempt to use NVML-based metrics natively in KEDA
Scale-to-zero with GPU pods incurs longer cold start times than CPU pods because GPU driver initialization and model loading add significant startup overhead; set activation thresholds conservatively
DCGM metrics reflect per-GPU utilization, not per-pod — if multiple inference pods share a node, the metric may trigger scaling even when the bottleneck is a single saturated pod rather than all pods
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp