Steps

Install lm-eval via pip and select a backend: HuggingFace transformers (hf), a local vLLM server (local-completions), or the OpenAI API (openai-completions)
Run lm_eval --model hf --model_args pretrained=<model_id> --tasks hellaswag,arc_easy,mmlu --device cuda:0 --output_path results/
Inspect results/results.json for per-task accuracy, normalized accuracy, and stderr; compare against published leaderboard numbers
Add custom tasks by writing a YAML task config in a local directory and passing --include_path <dir> to register it without modifying the package
Use --limit 100 for quick sanity checks during development to avoid running full task suites on every model iteration

Known gotchas

MMLU requires the model to correctly handle 4-way multiple choice formatting — models that ignore the A/B/C/D prompt structure will score near random chance regardless of capability
Batch size affects throughput but not correctness for most tasks; however, tasks using rolling loglikelihood (e.g., HellaSwag) can have slight numerical differences across batch sizes due to padding
The --trust_remote_code flag is required for some Hugging Face models that use custom modeling code; omitting it will raise a trust error that looks like a missing dependency

github.com/EleutherAI/lm-evaluation-harness · 5 steps · unrated

Run standard benchmark evaluations on a Hugging Face model using EleutherAI's lm-evaluation-harness (lm-eval CLI)

ml-ops · 6 steps · unrated

Run a LangSmith evaluation experiment against a dataset using the evaluate() SDK function

docs.smith.langchain.com · 6 steps · unrated

Give your agent this knowledge — and 15,600+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Run lm-evaluation-harness to benchmark a language model on standard NLP tasks

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,600+ more routes

Need this verified for your stack — or a route we don't have yet?