Evaluation Lm Evaluation Toolkit
Battle-tested skill for evaluates, llms, across, academic. Includes structured workflows, validation checks, and reusable patterns for ai research.
LM Evaluation Harness (lm-eval)
Overview
A comprehensive skill for evaluating language models using EleutherAI's LM Evaluation Harness — the most widely used open-source framework for benchmarking LLMs. Used by HuggingFace Open LLM Leaderboard, supports 200+ benchmarks including MMLU, ARC, HellaSwag, GSM8K, TruthfulQA, and Winogrande. Provides standardized, reproducible evaluation across any model accessible via HuggingFace, vLLM, or API.
When to Use
- Benchmarking LLMs on standard academic benchmarks
- Comparing model quality before and after fine-tuning
- Reproducing Open LLM Leaderboard results
- Evaluating across reasoning, knowledge, and language understanding
- Running custom evaluation tasks
- Automated quality testing in CI/CD
Quick Start
# Install pip install lm-eval # Run evaluation lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3-8B \ --tasks mmlu,hellaswag,arc_challenge \ --batch_size auto \ --output_path ./results # List available tasks lm_eval --tasks list
Key Benchmarks
| Benchmark | Tasks | Metric | Category |
|---|---|---|---|
| MMLU | 57 subjects | Accuracy | Knowledge |
| ARC-Challenge | 1172 | Accuracy (norm) | Science reasoning |
| HellaSwag | 10042 | Accuracy (norm) | Commonsense |
| TruthfulQA | 817 | MC1/MC2 accuracy | Truthfulness |
| GSM8K | 1319 | Exact match | Math |
| Winogrande | 1267 | Accuracy | Coreference |
| GPQA | 448 | Accuracy | PhD-level reasoning |
| IFEval | 541 | Strict/loose accuracy | Instruction following |
Evaluation Commands
Open LLM Leaderboard V2 Setup
# Reproduce the HuggingFace Open LLM Leaderboard lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3-8B-Instruct \ --tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc2,winogrande,gsm8k \ --num_fewshot 5,25,10,0,5,5 \ --batch_size auto:4 \ --output_path ./leaderboard_results
vLLM Backend (Faster)
# Use vLLM for much faster inference lm_eval --model vllm \ --model_args pretrained=meta-llama/Llama-3-8B,tensor_parallel_size=2,gpu_memory_utilization=0.9 \ --tasks mmlu \ --batch_size auto
API-Based Evaluation
# Evaluate OpenAI models lm_eval --model openai-completions \ --model_args model=gpt-4 \ --tasks mmlu \ --num_fewshot 5 # Evaluate Anthropic models lm_eval --model anthropic \ --model_args model=claude-3-sonnet \ --tasks mmlu,gsm8k
Custom Task
# tasks/my_task/my_task.yaml task: my_custom_eval dataset_path: my_org/my_dataset dataset_name: default output_type: multiple_choice training_split: train test_split: test num_fewshot: 5 metric_list: - metric: acc aggregation: mean higher_is_better: true doc_to_text: "Question: {{question}}\nAnswer:" doc_to_target: "{{answer}}" doc_to_choice: ["A", "B", "C", "D"]
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
--model | Required | Backend: hf, vllm, openai-completions, etc. |
--tasks | Required | Comma-separated task names |
--num_fewshot | Task default | Number of few-shot examples |
--batch_size | 1 | Batch size (use auto for automatic) |
--device | cuda:0 | Device to run on |
--output_path | None | Directory for results JSON |
--limit | None | Limit samples per task (for testing) |
--log_samples | False | Save per-sample predictions |
--gen_kwargs | None | Generation kwargs (temperature, etc.) |
Best Practices
- Use
batch_size auto— Automatically finds largest batch size that fits in memory - Match few-shot counts to leaderboard — Different n-shot settings give very different results
- Use vLLM backend for speed — 2-5x faster than standard HuggingFace backend
- Log samples for error analysis —
--log_samplessaves each prediction for debugging - Run with
--limit 10first — Quick test before full evaluation - Use normalized accuracy where available —
acc_normadjusts for answer length bias - Compare at same precision — bf16 vs fp32 can cause small differences
- Report full config — Include model revision, few-shot count, and batch size
- Use task groups —
--tasks leaderboardruns all leaderboard tasks - Cache model outputs — Re-evaluation with
--use_cacheavoids redundant inference
Troubleshooting
CUDA OOM
# Reduce batch size --batch_size 1 # Or use quantization --model_args pretrained=model,load_in_4bit=True # Or use model parallelism with vLLM --model vllm --model_args pretrained=model,tensor_parallel_size=4
Results differ from leaderboard
# Ensure matching configuration # Check: model revision, num_fewshot, normalization # Use exact model revision --model_args pretrained=model,revision=main # Match exact few-shot count
Task not found
# List all available tasks lm_eval --tasks list # Or search lm_eval --tasks list | grep mmlu
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.