Register the model in Unity Catalog via mlflow.register_model(model_uri, 'catalog.schema.model_name') or the MLflow UI
In the Databricks UI navigate to Serving, click Create serving endpoint, select the Unity Catalog registered model and the desired model version
Configure compute: choose a CPU or GPU instance size and set the scale-to-zero option if intermittent traffic is expected
Click Create — the endpoint transitions through Pending to Ready state, which can take several minutes
Query the endpoint via its REST URL using an Authorization header with a Databricks personal access token: POST https://<workspace-url>/serving-endpoints/<endpoint-name>/invocations with a JSON payload in the dataframe_records or dataframe_split format
Monitor latency and throughput in the Serving tab and set up alerts via Databricks Lakehouse Monitoring or CloudWatch if on AWS
Known gotchas
Scale-to-zero endpoints have a cold-start latency of 30–90 seconds on first request after idle; for latency-sensitive applications keep at least one replica warm by setting min provisioned throughput above zero
The serving endpoint expects input in MLflow serving input formats (dataframe_records, dataframe_split, or tf-serving tensor); sending raw JSON without the wrapper key causes a 422 error
Unity Catalog model permissions must grant EXECUTE to the service principal or user making the inference request — missing grants result in a 403 even if the endpoint is healthy
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp