Install lm-eval via pip and select a backend: HuggingFace transformers (hf), a local vLLM server (local-completions), or the OpenAI API (openai-completions)
Inspect results/results.json for per-task accuracy, normalized accuracy, and stderr; compare against published leaderboard numbers
Add custom tasks by writing a YAML task config in a local directory and passing --include_path <dir> to register it without modifying the package
Use --limit 100 for quick sanity checks during development to avoid running full task suites on every model iteration
Known gotchas
MMLU requires the model to correctly handle 4-way multiple choice formatting — models that ignore the A/B/C/D prompt structure will score near random chance regardless of capability
Batch size affects throughput but not correctness for most tasks; however, tasks using rolling loglikelihood (e.g., HellaSwag) can have slight numerical differences across batch sizes due to padding
The --trust_remote_code flag is required for some Hugging Face models that use custom modeling code; omitting it will raise a trust error that looks like a missing dependency
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp