Set the BRAINTRUST_API_KEY environment variable to your API key
Define an Eval block in a Python file: call Eval('my-project', data=lambda: [{'input': ..., 'expected': ...}], task=lambda input: my_llm_function(input), scores=[autoevals.Levenshtein])
Run the evaluation: braintrust eval eval_file.py — results are uploaded to Braintrust and a summary is printed to the terminal
Compare experiment runs in the Braintrust UI to see score regressions across versions
Gate CI by passing --fail-on-score-decrease to the CLI command or inspecting the returned experiment summary for score thresholds
Known gotchas
The data parameter accepts a callable that returns an iterable — using a plain list works but reloads all data into memory; use a generator for large datasets
braintrust eval reruns the entire dataset on every invocation; use the --filter flag to target a subset of test cases during development to avoid unnecessary LLM calls
Scores must be in the range [0, 1]; scorers that return values outside this range cause display anomalies in the UI and incorrect regression detection
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp