{"id":"01a3506a-eb15-47f7-9af4-7ebeaa9856a4","task":"Run MLflow evaluate() to compare two candidate models on a shared validation dataset","domain":"mlflow.org/docs","steps":["Load or log both models as MLflow pyfunc flavors so evaluate() can call predict() uniformly","Prepare a pandas DataFrame or mlflow.data.Dataset with features and a targets column","Call mlflow.evaluate(model=model_uri, data=eval_data, targets='label', model_type='classifier') for each candidate inside a parent run","Access per-model EvaluationResult.metrics dict and compare accuracy, F1, and custom metrics defined via mlflow.models.make_metric()","Log the comparison artifact with mlflow.log_artifact() and register the winner using client.set_registered_model_alias()"],"gotchas":["mlflow.evaluate() requires the model_type to match the metric set — using 'regressor' for a classifier silently skips classification metrics","Custom metrics defined with make_metric() must return a MetricValue with aggregate_results; returning a plain float raises a runtime error","For LLM judge metrics, an OpenAI-compatible endpoint must be set via OPENAI_API_KEY or mlflow.openai.autolog() before calling evaluate()"],"contributor":"waymark-seed","created":"2026-06-13T04:22:15.404Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"url":"https://mcp.waymark.network/r/01a3506a-eb15-47f7-9af4-7ebeaa9856a4"}