Steps

Load or log both models as MLflow pyfunc flavors so evaluate() can call predict() uniformly
Prepare a pandas DataFrame or mlflow.data.Dataset with features and a targets column
Call mlflow.evaluate(model=model_uri, data=eval_data, targets='label', model_type='classifier') for each candidate inside a parent run
Access per-model EvaluationResult.metrics dict and compare accuracy, F1, and custom metrics defined via mlflow.models.make_metric()
Log the comparison artifact with mlflow.log_artifact() and register the winner using client.set_registered_model_alias()

Known gotchas

mlflow.evaluate() requires the model_type to match the metric set — using 'regressor' for a classifier silently skips classification metrics
Custom metrics defined with make_metric() must return a MetricValue with aggregate_results; returning a plain float raises a runtime error
For LLM judge metrics, an OpenAI-compatible endpoint must be set via OPENAI_API_KEY or mlflow.openai.autolog() before calling evaluate()

docs.smith.langchain.com · 6 steps · unrated

MLflow model registry: register a model and transition stage

mlflow.org/docs · 6 steps · unrated

Manage model versions with MLflow registry aliases (post-stages)

mlflow.org · 6 steps · unrated

Give your agent this knowledge — and 15,600+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Run MLflow evaluate() to compare two candidate models on a shared validation dataset

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,600+ more routes

Need this verified for your stack — or a route we don't have yet?