Authenticate with your OpenAI API key and confirm your organization has access to the Evals API
Define a data_source_config object that specifies the schema of your test data (fields for prompt and expected output)
Define a testing_criteria array specifying one or more grader objects, such as a model-graded criterion with a scoring rubric
POST to the /v1/evals endpoint to create the eval configuration and capture the returned eval_id
POST to /v1/evals/{eval_id}/runs to launch a run against your data source, passing the run configuration
Poll the run status and retrieve per-sample results once the run reaches a terminal state
Known gotchas
The OpenAI Evals platform is scheduled to become read-only for existing users in late 2026 and shut down thereafter — build new pipelines with this timeline in mind
The data_source_config schema must match the field names referenced in your testing_criteria exactly; schema mismatches cause run failures with opaque error messages
Model-graded criteria incur additional token costs on top of the test data inference costs; budget accordingly for large eval sets
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp