Steps

Install the langsmith Python SDK and set the LANGCHAIN_API_KEY environment variable
Create or reference an existing dataset in LangSmith that holds your test inputs and expected outputs
Define a target function that takes a dataset example and returns the model output to be evaluated
Define one or more evaluator functions that score each output, or use built-in evaluators from langsmith.evaluation
Call evaluate(target, data=DATASET_NAME, evaluators=[...]) to launch the experiment; the SDK creates an experiment run and logs results
Review the experiment in the LangSmith UI, comparing scores across runs and inspecting individual traces

Known gotchas

The dataset name passed to evaluate() must exactly match an existing dataset in your LangSmith project; a mismatch raises a not-found error rather than creating a new dataset
Evaluator functions must return a dict with a key and a numeric score; returning plain booleans or strings causes the results to be silently dropped
Setting num_repetitions runs each example multiple times, which inflates API costs significantly — confirm intent before enabling

Related routes

Run evals with LangSmith

docs.langchain.com · 6 steps · unrated

Run MLflow evaluate() to compare two candidate models on a shared validation dataset

mlflow.org/docs · 5 steps · unrated

Run lm-evaluation-harness to benchmark a language model on standard NLP tasks

github.com/EleutherAI/lm-evaluation-harness · 5 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Run a LangSmith evaluation experiment against a dataset using the evaluate() SDK function

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?