Steps

Set environment variables LANGCHAIN_API_KEY and LANGCHAIN_TRACING_V2=true; install the SDK: pip install langsmith
Create a dataset: client = langsmith.Client(); dataset = client.create_dataset('my-dataset'); client.create_examples(inputs=[{'question': '...'}], outputs=[{'answer': '...'}], dataset_id=dataset.id)
Define a target function that takes a dict of inputs and returns a dict of outputs — this wraps the LLM call or chain being evaluated
Define one or more evaluator functions that accept a dict with 'inputs', 'outputs', and 'reference_outputs' keys and return an EvaluationResult with a score or label
Run the evaluation: results = langsmith.evaluate(target, data='my-dataset', evaluators=[my_evaluator], experiment_prefix='run-1')
Inspect results in the LangSmith UI under the Datasets & Testing tab, or read results.to_pandas() programmatically

Known gotchas

The evaluate() function is synchronous by default; use aevaluate() with an async target and evaluators for faster evaluation of large datasets
LangSmith traces every call made inside the target function when LANGCHAIN_TRACING_V2 is set — this generates significant trace volume and cost for large datasets; scope tracing to eval runs only if needed
Evaluator functions must return an EvaluationResult object or a dict with at least a 'score' key — returning a plain number causes a deserialization error in the SDK

docs.smith.langchain.com · 6 steps · unrated

Run evals with Braintrust

braintrust.dev · 6 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Run evals with LangSmith

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?