E

Evaluation Bigcode Smart

Streamline your workflow with this evaluates, code, generation, models. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

BigCode Evaluation Harness

Overview

A comprehensive skill for evaluating code generation models using the BigCode Evaluation Harness — the standard framework for benchmarking LLMs on coding tasks including HumanEval, MBPP, DS-1000, and MultiPL-E. Covers model evaluation, benchmark configuration, custom task creation, and result analysis for comparing code generation capabilities across models.

When to Use

  • Evaluating code generation model quality
  • Benchmarking models on HumanEval, MBPP, or other coding benchmarks
  • Comparing fine-tuned models against baselines
  • Running multi-language code evaluation (Python, JavaScript, Java, etc.)
  • Creating custom code evaluation benchmarks
  • Measuring pass@k metrics for code completion

Quick Start

# Install git clone https://github.com/bigcode-project/bigcode-evaluation-harness cd bigcode-evaluation-harness pip install -e . # Run HumanEval evaluation accelerate launch main.py \ --model meta-llama/CodeLlama-7b-hf \ --tasks humaneval \ --do_sample True \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution

Core Benchmarks

BenchmarkTasksLanguagesMetricDescription
HumanEval164Pythonpass@kHand-written function completion
HumanEval+164Pythonpass@kExtended test cases (80x more)
MBPP500Pythonpass@kCrowd-sourced basic programs
MBPP+500Pythonpass@kEnhanced test coverage
MultiPL-E16418 langspass@kMulti-language HumanEval
DS-10001000Pythonpass@1Data science libraries
APPS10000Pythonpass@kCompetition-level problems

Evaluation Commands

HumanEval with Multiple Samples

# Generate 200 samples for pass@1, pass@10, pass@100 accelerate launch main.py \ --model codellama/CodeLlama-7b-hf \ --tasks humaneval \ --do_sample True \ --temperature 0.8 \ --top_p 0.95 \ --n_samples 200 \ --batch_size 20 \ --max_length_generation 512 \ --allow_code_execution \ --save_generations \ --save_generations_path ./results/codellama-7b-humaneval.json \ --metric_output_path ./results/codellama-7b-metrics.json

Multi-Language Evaluation

# Evaluate on Python, JavaScript, Java, C++, Rust for lang in python javascript java cpp rust; do accelerate launch main.py \ --model codellama/CodeLlama-7b-hf \ --tasks multiple-$lang \ --do_sample True \ --temperature 0.2 \ --n_samples 20 \ --allow_code_execution \ --metric_output_path ./results/codellama-7b-$lang.json done

Custom Model Evaluation

# Evaluate a local model or custom model accelerate launch main.py \ --model ./my-fine-tuned-model \ --tasks humaneval mbpp \ --do_sample True \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution \ --trust_remote_code

Understanding pass@k

import numpy as np def pass_at_k(n, c, k): """ Calculate pass@k metric. n: total samples generated c: number of correct samples k: k in pass@k """ if n - c < k: return 1.0 return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1)) # Example: 200 samples, 140 correct print(f"pass@1: {pass_at_k(200, 140, 1):.4f}") # ~0.7000 print(f"pass@10: {pass_at_k(200, 140, 10):.4f}") # ~0.9990 print(f"pass@100: {pass_at_k(200, 140, 100):.4f}") # ~1.0000

Configuration Reference

ParameterDefaultDescription
--tasksRequiredBenchmark names (humaneval, mbpp, etc.)
--n_samples1Number of completions per task
--temperature0.2Sampling temperature
--top_p0.95Nucleus sampling threshold
--max_length_generation512Max tokens to generate
--batch_size1Inference batch size
--allow_code_executionFalseEnable code execution for testing
--save_generationsFalseSave generated code to file
--precisionfp32Model precision (fp16, bf16, fp32)

Best Practices

  1. Use temperature 0.2 for pass@1 — Low temperature for best single-shot accuracy
  2. Use temperature 0.8 for pass@k (k>1) — Higher temperature for diverse samples
  3. Generate 200 samples for pass@100 — Need n > k for reliable estimation
  4. Run in sandboxed environment--allow_code_execution runs generated code
  5. Use HumanEval+ over HumanEval — More thorough test cases catch edge case failures
  6. Compare with consistent settings — Same temperature, top_p, and n_samples across models
  7. Report confidence intervals — pass@k is an estimate; report standard deviation
  8. Test multi-language — Python-only benchmarks miss language-specific strengths
  9. Use batch evaluation — Higher batch_size speeds up inference significantly
  10. Save all generations — Enables post-hoc analysis and error categorization

Troubleshooting

Code execution timeout

# Increase timeout for complex problems --execution_timeout 60 # seconds, default is usually 10

CUDA OOM during evaluation

# Reduce batch size and use quantization --batch_size 1 --precision bf16 # Or use smaller max_length_generation --max_length_generation 256

Results not reproducible

# Set seed for reproducibility --seed 42 # Note: do_sample=True with temperature>0 adds randomness # For exact reproducibility, use do_sample=False (greedy)
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates