Evaluation Bigcode Smart
Streamline your workflow with this evaluates, code, generation, models. Includes structured workflows, validation checks, and reusable patterns for ai research.
BigCode Evaluation Harness
Overview
A comprehensive skill for evaluating code generation models using the BigCode Evaluation Harness — the standard framework for benchmarking LLMs on coding tasks including HumanEval, MBPP, DS-1000, and MultiPL-E. Covers model evaluation, benchmark configuration, custom task creation, and result analysis for comparing code generation capabilities across models.
When to Use
- Evaluating code generation model quality
- Benchmarking models on HumanEval, MBPP, or other coding benchmarks
- Comparing fine-tuned models against baselines
- Running multi-language code evaluation (Python, JavaScript, Java, etc.)
- Creating custom code evaluation benchmarks
- Measuring pass@k metrics for code completion
Quick Start
# Install git clone https://github.com/bigcode-project/bigcode-evaluation-harness cd bigcode-evaluation-harness pip install -e . # Run HumanEval evaluation accelerate launch main.py \ --model meta-llama/CodeLlama-7b-hf \ --tasks humaneval \ --do_sample True \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution
Core Benchmarks
| Benchmark | Tasks | Languages | Metric | Description |
|---|---|---|---|---|
| HumanEval | 164 | Python | pass@k | Hand-written function completion |
| HumanEval+ | 164 | Python | pass@k | Extended test cases (80x more) |
| MBPP | 500 | Python | pass@k | Crowd-sourced basic programs |
| MBPP+ | 500 | Python | pass@k | Enhanced test coverage |
| MultiPL-E | 164 | 18 langs | pass@k | Multi-language HumanEval |
| DS-1000 | 1000 | Python | pass@1 | Data science libraries |
| APPS | 10000 | Python | pass@k | Competition-level problems |
Evaluation Commands
HumanEval with Multiple Samples
# Generate 200 samples for pass@1, pass@10, pass@100 accelerate launch main.py \ --model codellama/CodeLlama-7b-hf \ --tasks humaneval \ --do_sample True \ --temperature 0.8 \ --top_p 0.95 \ --n_samples 200 \ --batch_size 20 \ --max_length_generation 512 \ --allow_code_execution \ --save_generations \ --save_generations_path ./results/codellama-7b-humaneval.json \ --metric_output_path ./results/codellama-7b-metrics.json
Multi-Language Evaluation
# Evaluate on Python, JavaScript, Java, C++, Rust for lang in python javascript java cpp rust; do accelerate launch main.py \ --model codellama/CodeLlama-7b-hf \ --tasks multiple-$lang \ --do_sample True \ --temperature 0.2 \ --n_samples 20 \ --allow_code_execution \ --metric_output_path ./results/codellama-7b-$lang.json done
Custom Model Evaluation
# Evaluate a local model or custom model accelerate launch main.py \ --model ./my-fine-tuned-model \ --tasks humaneval mbpp \ --do_sample True \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution \ --trust_remote_code
Understanding pass@k
import numpy as np def pass_at_k(n, c, k): """ Calculate pass@k metric. n: total samples generated c: number of correct samples k: k in pass@k """ if n - c < k: return 1.0 return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1)) # Example: 200 samples, 140 correct print(f"pass@1: {pass_at_k(200, 140, 1):.4f}") # ~0.7000 print(f"pass@10: {pass_at_k(200, 140, 10):.4f}") # ~0.9990 print(f"pass@100: {pass_at_k(200, 140, 100):.4f}") # ~1.0000
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
--tasks | Required | Benchmark names (humaneval, mbpp, etc.) |
--n_samples | 1 | Number of completions per task |
--temperature | 0.2 | Sampling temperature |
--top_p | 0.95 | Nucleus sampling threshold |
--max_length_generation | 512 | Max tokens to generate |
--batch_size | 1 | Inference batch size |
--allow_code_execution | False | Enable code execution for testing |
--save_generations | False | Save generated code to file |
--precision | fp32 | Model precision (fp16, bf16, fp32) |
Best Practices
- Use temperature 0.2 for pass@1 — Low temperature for best single-shot accuracy
- Use temperature 0.8 for pass@k (k>1) — Higher temperature for diverse samples
- Generate 200 samples for pass@100 — Need n > k for reliable estimation
- Run in sandboxed environment —
--allow_code_executionruns generated code - Use HumanEval+ over HumanEval — More thorough test cases catch edge case failures
- Compare with consistent settings — Same temperature, top_p, and n_samples across models
- Report confidence intervals — pass@k is an estimate; report standard deviation
- Test multi-language — Python-only benchmarks miss language-specific strengths
- Use batch evaluation — Higher batch_size speeds up inference significantly
- Save all generations — Enables post-hoc analysis and error categorization
Troubleshooting
Code execution timeout
# Increase timeout for complex problems --execution_timeout 60 # seconds, default is usually 10
CUDA OOM during evaluation
# Reduce batch size and use quantization --batch_size 1 --precision bf16 # Or use smaller max_length_generation --max_length_generation 256
Results not reproducible
# Set seed for reproducibility --seed 42 # Note: do_sample=True with temperature>0 adds randomness # For exact reproducibility, use do_sample=False (greedy)
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.