Install promptfoo CLI (npm install -g promptfoo) and create a promptfooconfig.yaml in your repository
Define providers (e.g., openai:gpt-4o), prompts, and test cases with assert blocks specifying pass/fail criteria such as contains, llm-rubric, or regex
Add a threshold field in the config to set the minimum pass rate required (e.g., 0.9 for 90%); runs below this threshold exit with a non-zero code
Add a promptfoo eval --ci step to your CI workflow (GitHub Actions, GitLab CI, etc.); the non-zero exit code blocks merges on failure
Use promptfoo eval --output results.json to capture detailed per-test results as an artifact for review
Use the GitHub Action integration to automatically post evaluation result summaries as pull request comments
Known gotchas
llm-rubric assertions themselves use a configured LLM judge and add latency and cost to every CI run — cache results where possible and scope test cases tightly
The threshold applies to the overall pass rate across all test cases; a single catastrophic failure on a high-weight prompt can drop the overall rate below threshold even if most tests pass
API keys for providers must be available as CI secrets; missing keys cause provider calls to fail with authentication errors that can be confused with assertion failures
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp