LM Evaluation Harness (lm-eval)

Overview

A comprehensive skill for evaluating language models using EleutherAI's LM Evaluation Harness — the most widely used open-source framework for benchmarking LLMs. Used by HuggingFace Open LLM Leaderboard, supports 200+ benchmarks including MMLU, ARC, HellaSwag, GSM8K, TruthfulQA, and Winogrande. Provides standardized, reproducible evaluation across any model accessible via HuggingFace, vLLM, or API.

When to Use

Benchmarking LLMs on standard academic benchmarks
Comparing model quality before and after fine-tuning
Reproducing Open LLM Leaderboard results
Evaluating across reasoning, knowledge, and language understanding
Running custom evaluation tasks
Automated quality testing in CI/CD

Quick Start


# Install
pip install lm-eval

# Run evaluation
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3-8B \
  --tasks mmlu,hellaswag,arc_challenge \
  --batch_size auto \
  --output_path ./results

# List available tasks
lm_eval --tasks list

Key Benchmarks

Benchmark	Tasks	Metric	Category
MMLU	57 subjects	Accuracy	Knowledge
ARC-Challenge	1172	Accuracy (norm)	Science reasoning
HellaSwag	10042	Accuracy (norm)	Commonsense
TruthfulQA	817	MC1/MC2 accuracy	Truthfulness
GSM8K	1319	Exact match	Math
Winogrande	1267	Accuracy	Coreference
GPQA	448	Accuracy	PhD-level reasoning
IFEval	541	Strict/loose accuracy	Instruction following

Evaluation Commands

Open LLM Leaderboard V2 Setup


# Reproduce the HuggingFace Open LLM Leaderboard
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3-8B-Instruct \
  --tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc2,winogrande,gsm8k \
  --num_fewshot 5,25,10,0,5,5 \
  --batch_size auto:4 \
  --output_path ./leaderboard_results

vLLM Backend (Faster)


# Use vLLM for much faster inference
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3-8B,tensor_parallel_size=2,gpu_memory_utilization=0.9 \
  --tasks mmlu \
  --batch_size auto

API-Based Evaluation


# Evaluate OpenAI models
lm_eval --model openai-completions \
  --model_args model=gpt-4 \
  --tasks mmlu \
  --num_fewshot 5

# Evaluate Anthropic models
lm_eval --model anthropic \
  --model_args model=claude-3-sonnet \
  --tasks mmlu,gsm8k

Custom Task


# tasks/my_task/my_task.yaml
task: my_custom_eval
dataset_path: my_org/my_dataset
dataset_name: default
output_type: multiple_choice
training_split: train
test_split: test
num_fewshot: 5
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
doc_to_choice: ["A", "B", "C", "D"]

Configuration Reference

Parameter	Default	Description
`--model`	Required	Backend: hf, vllm, openai-completions, etc.
`--tasks`	Required	Comma-separated task names
`--num_fewshot`	Task default	Number of few-shot examples
`--batch_size`	1	Batch size (use `auto` for automatic)
`--device`	cuda:0	Device to run on
`--output_path`	None	Directory for results JSON
`--limit`	None	Limit samples per task (for testing)
`--log_samples`	False	Save per-sample predictions
`--gen_kwargs`	None	Generation kwargs (temperature, etc.)

Best Practices

Use batch_size auto — Automatically finds largest batch size that fits in memory
Match few-shot counts to leaderboard — Different n-shot settings give very different results
Use vLLM backend for speed — 2-5x faster than standard HuggingFace backend
Log samples for error analysis — --log_samples saves each prediction for debugging
Run with --limit 10 first — Quick test before full evaluation
Use normalized accuracy where available — acc_norm adjusts for answer length bias
Compare at same precision — bf16 vs fp32 can cause small differences
Report full config — Include model revision, few-shot count, and batch size
Use task groups — --tasks leaderboard runs all leaderboard tasks
Cache model outputs — Re-evaluation with --use_cache avoids redundant inference

Troubleshooting

CUDA OOM


# Reduce batch size
--batch_size 1
# Or use quantization
--model_args pretrained=model,load_in_4bit=True
# Or use model parallelism with vLLM
--model vllm --model_args pretrained=model,tensor_parallel_size=4

Results differ from leaderboard


# Ensure matching configuration
# Check: model revision, num_fewshot, normalization
# Use exact model revision
--model_args pretrained=model,revision=main
# Match exact few-shot count

Task not found


# List all available tasks
lm_eval --tasks list
# Or search
lm_eval --tasks list | grep mmlu

⚠️ Loading Issue

Evaluation Lm Evaluation Toolkit