E

Evaluation Lm Evaluation Toolkit

Battle-tested skill for evaluates, llms, across, academic. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

LM Evaluation Harness (lm-eval)

Overview

A comprehensive skill for evaluating language models using EleutherAI's LM Evaluation Harness — the most widely used open-source framework for benchmarking LLMs. Used by HuggingFace Open LLM Leaderboard, supports 200+ benchmarks including MMLU, ARC, HellaSwag, GSM8K, TruthfulQA, and Winogrande. Provides standardized, reproducible evaluation across any model accessible via HuggingFace, vLLM, or API.

When to Use

  • Benchmarking LLMs on standard academic benchmarks
  • Comparing model quality before and after fine-tuning
  • Reproducing Open LLM Leaderboard results
  • Evaluating across reasoning, knowledge, and language understanding
  • Running custom evaluation tasks
  • Automated quality testing in CI/CD

Quick Start

# Install pip install lm-eval # Run evaluation lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3-8B \ --tasks mmlu,hellaswag,arc_challenge \ --batch_size auto \ --output_path ./results # List available tasks lm_eval --tasks list

Key Benchmarks

BenchmarkTasksMetricCategory
MMLU57 subjectsAccuracyKnowledge
ARC-Challenge1172Accuracy (norm)Science reasoning
HellaSwag10042Accuracy (norm)Commonsense
TruthfulQA817MC1/MC2 accuracyTruthfulness
GSM8K1319Exact matchMath
Winogrande1267AccuracyCoreference
GPQA448AccuracyPhD-level reasoning
IFEval541Strict/loose accuracyInstruction following

Evaluation Commands

Open LLM Leaderboard V2 Setup

# Reproduce the HuggingFace Open LLM Leaderboard lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3-8B-Instruct \ --tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc2,winogrande,gsm8k \ --num_fewshot 5,25,10,0,5,5 \ --batch_size auto:4 \ --output_path ./leaderboard_results

vLLM Backend (Faster)

# Use vLLM for much faster inference lm_eval --model vllm \ --model_args pretrained=meta-llama/Llama-3-8B,tensor_parallel_size=2,gpu_memory_utilization=0.9 \ --tasks mmlu \ --batch_size auto

API-Based Evaluation

# Evaluate OpenAI models lm_eval --model openai-completions \ --model_args model=gpt-4 \ --tasks mmlu \ --num_fewshot 5 # Evaluate Anthropic models lm_eval --model anthropic \ --model_args model=claude-3-sonnet \ --tasks mmlu,gsm8k

Custom Task

# tasks/my_task/my_task.yaml task: my_custom_eval dataset_path: my_org/my_dataset dataset_name: default output_type: multiple_choice training_split: train test_split: test num_fewshot: 5 metric_list: - metric: acc aggregation: mean higher_is_better: true doc_to_text: "Question: {{question}}\nAnswer:" doc_to_target: "{{answer}}" doc_to_choice: ["A", "B", "C", "D"]

Configuration Reference

ParameterDefaultDescription
--modelRequiredBackend: hf, vllm, openai-completions, etc.
--tasksRequiredComma-separated task names
--num_fewshotTask defaultNumber of few-shot examples
--batch_size1Batch size (use auto for automatic)
--devicecuda:0Device to run on
--output_pathNoneDirectory for results JSON
--limitNoneLimit samples per task (for testing)
--log_samplesFalseSave per-sample predictions
--gen_kwargsNoneGeneration kwargs (temperature, etc.)

Best Practices

  1. Use batch_size auto — Automatically finds largest batch size that fits in memory
  2. Match few-shot counts to leaderboard — Different n-shot settings give very different results
  3. Use vLLM backend for speed — 2-5x faster than standard HuggingFace backend
  4. Log samples for error analysis--log_samples saves each prediction for debugging
  5. Run with --limit 10 first — Quick test before full evaluation
  6. Use normalized accuracy where availableacc_norm adjusts for answer length bias
  7. Compare at same precision — bf16 vs fp32 can cause small differences
  8. Report full config — Include model revision, few-shot count, and batch size
  9. Use task groups--tasks leaderboard runs all leaderboard tasks
  10. Cache model outputs — Re-evaluation with --use_cache avoids redundant inference

Troubleshooting

CUDA OOM

# Reduce batch size --batch_size 1 # Or use quantization --model_args pretrained=model,load_in_4bit=True # Or use model parallelism with vLLM --model vllm --model_args pretrained=model,tensor_parallel_size=4

Results differ from leaderboard

# Ensure matching configuration # Check: model revision, num_fewshot, normalization # Use exact model revision --model_args pretrained=model,revision=main # Match exact few-shot count

Task not found

# List all available tasks lm_eval --tasks list # Or search lm_eval --tasks list | grep mmlu
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates