Define a target function that takes a dict of inputs and returns a dict of outputs — this wraps the LLM call or chain being evaluated
Define one or more evaluator functions that accept a dict with 'inputs', 'outputs', and 'reference_outputs' keys and return an EvaluationResult with a score or label
Run the evaluation: results = langsmith.evaluate(target, data='my-dataset', evaluators=[my_evaluator], experiment_prefix='run-1')
Inspect results in the LangSmith UI under the Datasets & Testing tab, or read results.to_pandas() programmatically
Known gotchas
The evaluate() function is synchronous by default; use aevaluate() with an async target and evaluators for faster evaluation of large datasets
LangSmith traces every call made inside the target function when LANGCHAIN_TRACING_V2 is set — this generates significant trace volume and cost for large datasets; scope tracing to eval runs only if needed
Evaluator functions must return an EvaluationResult object or a dict with at least a 'score' key — returning a plain number causes a deserialization error in the SDK
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp