Steps

Install the SDK: pip install braintrust autoevals (autoevals provides ready-made scorers)
Set the BRAINTRUST_API_KEY environment variable to your API key
Define an Eval block in a Python file: call Eval('my-project', data=lambda: [{'input': ..., 'expected': ...}], task=lambda input: my_llm_function(input), scores=[autoevals.Levenshtein])
Run the evaluation: braintrust eval eval_file.py — results are uploaded to Braintrust and a summary is printed to the terminal
Compare experiment runs in the Braintrust UI to see score regressions across versions
Gate CI by passing --fail-on-score-decrease to the CLI command or inspecting the returned experiment summary for score thresholds

Known gotchas

The data parameter accepts a callable that returns an iterable — using a plain list works but reloads all data into memory; use a generator for large datasets
braintrust eval reruns the entire dataset on every invocation; use the --filter flag to target a subset of test cases during development to avoid unnecessary LLM calls
Scores must be in the range [0, 1]; scorers that return values outside this range cause display anomalies in the UI and incorrect regression detection

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Run evals with Braintrust

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?