NVIDIA NeMo Evaluator

Overview

A comprehensive skill for evaluating LLMs using NVIDIA NeMo Evaluator — a scalable, GPU-accelerated evaluation framework that supports standard benchmarks, custom evaluations, and LLM-as-judge scoring. NeMo Evaluator integrates with NeMo Framework for end-to-end model development and supports both local and API-based model evaluation at scale.

When to Use

Evaluating models trained with NeMo Framework
Running GPU-accelerated evaluations at scale
Using LLM-as-judge for open-ended generation quality
Comparing models across multiple benchmarks simultaneously
Integrating evaluation into NeMo training pipelines
Custom evaluation with domain-specific metrics

Quick Start


# Install NeMo with evaluation support
pip install nemo_toolkit[all]

# Or use NVIDIA container
docker run --gpus all -it nvcr.io/nvidia/nemo:25.04

# Run evaluation
python -m nemo.collections.llm.evaluation.evaluate \
  --model_path ./my_model \
  --tasks mmlu gsm8k \
  --output_dir ./eval_results

Evaluation Configuration


# eval_config.yaml
model:
  path: ./checkpoints/llama-8b
  type: nemo  # or hf, api

evaluation:
  tasks:
    - name: mmlu
      num_fewshot: 5
      metric: accuracy
    - name: gsm8k
      num_fewshot: 8
      metric: exact_match
    - name: humaneval
      metric: pass_at_1
      temperature: 0.2

  batch_size: 32
  precision: bf16
  tensor_parallel_size: 2

output:
  dir: ./results
  format: json
  include_predictions: true

LLM-as-Judge Evaluation


from nemo.collections.llm.evaluation import LLMJudge

judge_config = {
    "judge_model": "gpt-4",  # or local model
    "criteria": [
        {
            "name": "helpfulness",
            "description": "How helpful is the response to the user's query?",
            "scale": [1, 5],
        },
        {
            "name": "accuracy",
            "description": "How factually accurate is the response?",
            "scale": [1, 5],
        },
        {
            "name": "coherence",
            "description": "How well-organized and coherent is the response?",
            "scale": [1, 5],
        },
    ],
    "pairwise": False,  # Set True for A/B comparison
}

judge = LLMJudge(judge_config)

results = judge.evaluate(
    prompts=test_prompts,
    responses=model_responses,
)

for r in results:
    print(f"Helpfulness: {r['helpfulness']}, Accuracy: {r['accuracy']}")

Supported Benchmarks

Benchmark	Category	GPU-Accelerated	Description
MMLU	Knowledge	Yes	57-subject multiple choice
GSM8K	Math	Yes	Grade school math problems
HumanEval	Coding	Yes	Code generation
TruthfulQA	Safety	Yes	Truthfulness evaluation
MT-Bench	Chat	Via LLM-Judge	Multi-turn conversation
HELM	Comprehensive	Yes	Holistic evaluation
Custom	Any	Yes	User-defined tasks

Configuration Reference

Parameter	Default	Description
`model.path`	Required	Path to model checkpoint
`model.type`	nemo	Model format (nemo, hf, api)
`evaluation.batch_size`	32	Evaluation batch size
`evaluation.precision`	bf16	Compute precision
`evaluation.tensor_parallel_size`	1	GPU parallelism
`evaluation.num_fewshot`	Task default	Few-shot examples
`output.include_predictions`	false	Save per-sample outputs
`judge.model`	gpt-4	LLM judge model
`judge.criteria`	[]	Scoring dimensions

Best Practices

Use GPU-accelerated evaluation — NeMo parallelism enables faster evaluation on large models
Combine benchmarks and LLM-judge — Standard benchmarks for comparability, LLM-judge for quality
Evaluate during training — Set up periodic evaluation checkpoints in NeMo training config
Use multi-criteria judging — Score helpfulness, accuracy, and safety separately
Match evaluation to deployment — Use same precision, batch size, and context length as production
Save predictions — Enable include_predictions for error analysis
Use pairwise comparison — More reliable than absolute scoring for A/B testing
Track eval metrics over training — Plot evaluation curves alongside training loss
Calibrate your judge — Validate LLM-judge scores against human annotations
Test at small scale first — Use --limit 50 to verify setup before full evaluation

Troubleshooting

Evaluation hangs on multi-GPU


# Check NCCL configuration
export NCCL_DEBUG=INFO
# Reduce tensor parallelism
tensor_parallel_size: 1

NeMo checkpoint not loading


# Convert HF model to NeMo format
python -m nemo.collections.llm.tools.convert_hf \
  --input_path ./hf-model \
  --output_path ./nemo-model

LLM judge scores inconsistent


# Add temperature=0 for deterministic judging
judge_config["temperature"] = 0
# Use majority vote across 3 judgments
judge_config["num_judgments"] = 3
judge_config["aggregation"] = "majority"

⚠️ Loading Issue

Evaluation Nemo Evaluator System

NVIDIA NeMo Evaluator

Overview

When to Use

Quick Start

Evaluation Configuration

LLM-as-Judge Evaluation

Supported Benchmarks

Configuration Reference

Best Practices

Troubleshooting

Evaluation hangs on multi-GPU

NeMo checkpoint not loading

LLM judge scores inconsistent

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace