Install the langsmith Python SDK and set the LANGCHAIN_API_KEY environment variable
Create or reference an existing dataset in LangSmith that holds your test inputs and expected outputs
Define a target function that takes a dataset example and returns the model output to be evaluated
Define one or more evaluator functions that score each output, or use built-in evaluators from langsmith.evaluation
Call evaluate(target, data=DATASET_NAME, evaluators=[...]) to launch the experiment; the SDK creates an experiment run and logs results
Review the experiment in the LangSmith UI, comparing scores across runs and inspecting individual traces
Known gotchas
The dataset name passed to evaluate() must exactly match an existing dataset in your LangSmith project; a mismatch raises a not-found error rather than creating a new dataset
Evaluator functions must return a dict with a key and a numeric score; returning plain booleans or strings causes the results to be silently dropped
Setting num_repetitions runs each example multiple times, which inflates API costs significantly — confirm intent before enabling
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp