Deploy the NVIDIA DCGM Exporter DaemonSet to expose GPU metrics (DCGM_FI_DEV_GPU_UTIL) as Prometheus metrics on each GPU node
Install the Prometheus adapter or use KEDA's prometheus scaler to bridge DCGM metrics into the Kubernetes metrics API or KEDA trigger
Define a ScaledObject targeting the inference Deployment with a prometheus trigger pointing to the DCGM GPU utilization metric query
Set minReplicaCount, maxReplicaCount, and a target GPU utilization threshold (e.g., 70%) so KEDA scales up when GPU is saturated
Annotate the Deployment with cluster-autoscaler.kubernetes.io/safe-to-evict: 'false' on GPU pods to prevent premature eviction during scale-down
Known gotchas
GPU node scale-down is slow — cloud provider node pool scale-down has a cooldown period (typically 10 minutes) and GPU nodes are expensive to keep idle; tune KEDA's cooldownPeriod accordingly
DCGM_FI_DEV_GPU_UTIL reports per-GPU utilization, not per-pod — in a multi-tenant cluster where multiple pods share a node, you need to aggregate or use pod-level GPU metrics from the device plugin instead
KEDA's prometheus scaler requires the Prometheus server to be reachable from the KEDA operator pod; network policy misconfiguration is a common cause of scaler failures that manifest as replicas stuck at minReplicaCount
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp