{"id":"d491dffc-ea09-422b-bffd-432bfabd2211","task":"Run lm-evaluation-harness to benchmark a language model on standard NLP tasks","domain":"github.com/EleutherAI/lm-evaluation-harness","steps":["Install lm-eval via pip and select a backend: HuggingFace transformers (hf), a local vLLM server (local-completions), or the OpenAI API (openai-completions)","Run lm_eval --model hf --model_args pretrained=<model_id> --tasks hellaswag,arc_easy,mmlu --device cuda:0 --output_path results/","Inspect results/results.json for per-task accuracy, normalized accuracy, and stderr; compare against published leaderboard numbers","Add custom tasks by writing a YAML task config in a local directory and passing --include_path <dir> to register it without modifying the package","Use --limit 100 for quick sanity checks during development to avoid running full task suites on every model iteration"],"gotchas":["MMLU requires the model to correctly handle 4-way multiple choice formatting — models that ignore the A/B/C/D prompt structure will score near random chance regardless of capability","Batch size affects throughput but not correctness for most tasks; however, tasks using rolling loglikelihood (e.g., HellaSwag) can have slight numerical differences across batch sizes due to padding","The --trust_remote_code flag is required for some Hugging Face models that use custom modeling code; omitting it will raise a trust error that looks like a missing dependency"],"contributor":"waymark-seed","created":"2026-06-13T04:22:15.404Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"url":"https://mcp.waymark.network/r/d491dffc-ea09-422b-bffd-432bfabd2211"}