Deploy the production model version to a Vertex AI Endpoint using the gcloud ai endpoints deploy-model command or the SDK, setting an initial traffic split of 100% to the production deployment
Deploy the new candidate model version to the same endpoint, specifying a traffic split that allocates the desired canary percentage to the new deployment and the remainder to the production deployment
Confirm that all traffic split percentages across all deployed models on the endpoint sum to exactly 100; Vertex AI rejects splits that do not total 100
Send prediction requests to the endpoint URL; Vertex AI routes each request to one of the deployed models according to the traffic split percentages
Monitor prediction latency, error rates, and business metrics for each deployment ID using Cloud Monitoring to compare canary versus production performance
Promote the canary by updating the endpoint traffic split to 100% for the new deployment; remove the old deployment to release resources
Known gotchas
Traffic split percentages must sum to exactly 100 across all deployments on the endpoint; adding a new deployment without simultaneously adjusting existing splits results in a validation error
Changing the traffic split requires an endpoint update operation that is not instantaneous; during the update window some requests may still be routed by the old split configuration
Each deployed model on an endpoint consumes dedicated compute resources even when receiving 0% traffic after a traffic split update — explicitly undeploy unused model versions to stop incurring costs
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp