BigCode Evaluation Harness

Overview

A comprehensive skill for evaluating code generation models using the BigCode Evaluation Harness — the standard framework for benchmarking LLMs on coding tasks including HumanEval, MBPP, DS-1000, and MultiPL-E. Covers model evaluation, benchmark configuration, custom task creation, and result analysis for comparing code generation capabilities across models.

When to Use

Evaluating code generation model quality
Benchmarking models on HumanEval, MBPP, or other coding benchmarks
Comparing fine-tuned models against baselines
Running multi-language code evaluation (Python, JavaScript, Java, etc.)
Creating custom code evaluation benchmarks
Measuring pass@k metrics for code completion

Quick Start


# Install
git clone https://github.com/bigcode-project/bigcode-evaluation-harness
cd bigcode-evaluation-harness
pip install -e .

# Run HumanEval evaluation
accelerate launch main.py \
  --model meta-llama/CodeLlama-7b-hf \
  --tasks humaneval \
  --do_sample True \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution

Core Benchmarks

Benchmark	Tasks	Languages	Metric	Description
HumanEval	164	Python	pass@k	Hand-written function completion
HumanEval+	164	Python	pass@k	Extended test cases (80x more)
MBPP	500	Python	pass@k	Crowd-sourced basic programs
MBPP+	500	Python	pass@k	Enhanced test coverage
MultiPL-E	164	18 langs	pass@k	Multi-language HumanEval
DS-1000	1000	Python	pass@1	Data science libraries
APPS	10000	Python	pass@k	Competition-level problems

Evaluation Commands

HumanEval with Multiple Samples


# Generate 200 samples for pass@1, pass@10, pass@100
accelerate launch main.py \
  --model codellama/CodeLlama-7b-hf \
  --tasks humaneval \
  --do_sample True \
  --temperature 0.8 \
  --top_p 0.95 \
  --n_samples 200 \
  --batch_size 20 \
  --max_length_generation 512 \
  --allow_code_execution \
  --save_generations \
  --save_generations_path ./results/codellama-7b-humaneval.json \
  --metric_output_path ./results/codellama-7b-metrics.json

Multi-Language Evaluation


# Evaluate on Python, JavaScript, Java, C++, Rust
for lang in python javascript java cpp rust; do
  accelerate launch main.py \
    --model codellama/CodeLlama-7b-hf \
    --tasks multiple-$lang \
    --do_sample True \
    --temperature 0.2 \
    --n_samples 20 \
    --allow_code_execution \
    --metric_output_path ./results/codellama-7b-$lang.json
done

Custom Model Evaluation


# Evaluate a local model or custom model
accelerate launch main.py \
  --model ./my-fine-tuned-model \
  --tasks humaneval mbpp \
  --do_sample True \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --trust_remote_code

Understanding pass@k


import numpy as np

def pass_at_k(n, c, k):
    """
    Calculate pass@k metric.
    n: total samples generated
    c: number of correct samples
    k: k in pass@k
    """
    if n - c < k:
        return 1.0
    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))

# Example: 200 samples, 140 correct
print(f"pass@1:   {pass_at_k(200, 140, 1):.4f}")   # ~0.7000
print(f"pass@10:  {pass_at_k(200, 140, 10):.4f}")   # ~0.9990
print(f"pass@100: {pass_at_k(200, 140, 100):.4f}")   # ~1.0000

Configuration Reference

Parameter	Default	Description
`--tasks`	Required	Benchmark names (humaneval, mbpp, etc.)
`--n_samples`	1	Number of completions per task
`--temperature`	0.2	Sampling temperature
`--top_p`	0.95	Nucleus sampling threshold
`--max_length_generation`	512	Max tokens to generate
`--batch_size`	1	Inference batch size
`--allow_code_execution`	False	Enable code execution for testing
`--save_generations`	False	Save generated code to file
`--precision`	fp32	Model precision (fp16, bf16, fp32)

Best Practices

Use temperature 0.2 for pass@1 — Low temperature for best single-shot accuracy
Use temperature 0.8 for pass@k (k>1) — Higher temperature for diverse samples
Generate 200 samples for pass@100 — Need n > k for reliable estimation
Run in sandboxed environment — --allow_code_execution runs generated code
Use HumanEval+ over HumanEval — More thorough test cases catch edge case failures
Compare with consistent settings — Same temperature, top_p, and n_samples across models
Report confidence intervals — pass@k is an estimate; report standard deviation
Test multi-language — Python-only benchmarks miss language-specific strengths
Use batch evaluation — Higher batch_size speeds up inference significantly
Save all generations — Enables post-hoc analysis and error categorization

Troubleshooting

Code execution timeout


# Increase timeout for complex problems
--execution_timeout 60  # seconds, default is usually 10

CUDA OOM during evaluation


# Reduce batch size and use quantization
--batch_size 1 --precision bf16
# Or use smaller max_length_generation
--max_length_generation 256

Results not reproducible


# Set seed for reproducibility
--seed 42
# Note: do_sample=True with temperature>0 adds randomness
# For exact reproducibility, use do_sample=False (greedy)

⚠️ Loading Issue

Evaluation Bigcode Smart