Upload your trained model artifact to GCS and register it with Vertex AI using aiplatform.Model.upload(), specifying the serving container image URI.
Create an endpoint with aiplatform.Endpoint.create(), giving it a display name and the target project and location.
Deploy the model to the endpoint using endpoint.deploy(), specifying the model, machine type, min and max replica counts, and optionally traffic split.
Wait for the deployment to complete (the SDK call is synchronous by default but can take several minutes).
Send a prediction request using endpoint.predict(instances=[...]) where instances is a list of input dicts matching your model's expected schema.
Undeploy the model and delete the endpoint when finished to stop incurring costs.
Known gotchas
The serving container must expose a /predict HTTP endpoint; Vertex AI routes requests there and expects a JSON response in the {predictions: [...]} format.
Traffic split values across all deployed models on an endpoint must sum to 100; mismatched splits cause a validation error.
Quotas for online prediction QPS and node hours are project-level; exceeding them returns resource exhaustion errors rather than automatic scaling.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp