Evaluation Nemo Evaluator System
Comprehensive skill designed for evaluates, llms, across, benchmarks. Includes structured workflows, validation checks, and reusable patterns for ai research.
NVIDIA NeMo Evaluator
Overview
A comprehensive skill for evaluating LLMs using NVIDIA NeMo Evaluator — a scalable, GPU-accelerated evaluation framework that supports standard benchmarks, custom evaluations, and LLM-as-judge scoring. NeMo Evaluator integrates with NeMo Framework for end-to-end model development and supports both local and API-based model evaluation at scale.
When to Use
- Evaluating models trained with NeMo Framework
- Running GPU-accelerated evaluations at scale
- Using LLM-as-judge for open-ended generation quality
- Comparing models across multiple benchmarks simultaneously
- Integrating evaluation into NeMo training pipelines
- Custom evaluation with domain-specific metrics
Quick Start
# Install NeMo with evaluation support pip install nemo_toolkit[all] # Or use NVIDIA container docker run --gpus all -it nvcr.io/nvidia/nemo:25.04 # Run evaluation python -m nemo.collections.llm.evaluation.evaluate \ --model_path ./my_model \ --tasks mmlu gsm8k \ --output_dir ./eval_results
Evaluation Configuration
# eval_config.yaml model: path: ./checkpoints/llama-8b type: nemo # or hf, api evaluation: tasks: - name: mmlu num_fewshot: 5 metric: accuracy - name: gsm8k num_fewshot: 8 metric: exact_match - name: humaneval metric: pass_at_1 temperature: 0.2 batch_size: 32 precision: bf16 tensor_parallel_size: 2 output: dir: ./results format: json include_predictions: true
LLM-as-Judge Evaluation
from nemo.collections.llm.evaluation import LLMJudge judge_config = { "judge_model": "gpt-4", # or local model "criteria": [ { "name": "helpfulness", "description": "How helpful is the response to the user's query?", "scale": [1, 5], }, { "name": "accuracy", "description": "How factually accurate is the response?", "scale": [1, 5], }, { "name": "coherence", "description": "How well-organized and coherent is the response?", "scale": [1, 5], }, ], "pairwise": False, # Set True for A/B comparison } judge = LLMJudge(judge_config) results = judge.evaluate( prompts=test_prompts, responses=model_responses, ) for r in results: print(f"Helpfulness: {r['helpfulness']}, Accuracy: {r['accuracy']}")
Supported Benchmarks
| Benchmark | Category | GPU-Accelerated | Description |
|---|---|---|---|
| MMLU | Knowledge | Yes | 57-subject multiple choice |
| GSM8K | Math | Yes | Grade school math problems |
| HumanEval | Coding | Yes | Code generation |
| TruthfulQA | Safety | Yes | Truthfulness evaluation |
| MT-Bench | Chat | Via LLM-Judge | Multi-turn conversation |
| HELM | Comprehensive | Yes | Holistic evaluation |
| Custom | Any | Yes | User-defined tasks |
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
model.path | Required | Path to model checkpoint |
model.type | nemo | Model format (nemo, hf, api) |
evaluation.batch_size | 32 | Evaluation batch size |
evaluation.precision | bf16 | Compute precision |
evaluation.tensor_parallel_size | 1 | GPU parallelism |
evaluation.num_fewshot | Task default | Few-shot examples |
output.include_predictions | false | Save per-sample outputs |
judge.model | gpt-4 | LLM judge model |
judge.criteria | [] | Scoring dimensions |
Best Practices
- Use GPU-accelerated evaluation — NeMo parallelism enables faster evaluation on large models
- Combine benchmarks and LLM-judge — Standard benchmarks for comparability, LLM-judge for quality
- Evaluate during training — Set up periodic evaluation checkpoints in NeMo training config
- Use multi-criteria judging — Score helpfulness, accuracy, and safety separately
- Match evaluation to deployment — Use same precision, batch size, and context length as production
- Save predictions — Enable
include_predictionsfor error analysis - Use pairwise comparison — More reliable than absolute scoring for A/B testing
- Track eval metrics over training — Plot evaluation curves alongside training loss
- Calibrate your judge — Validate LLM-judge scores against human annotations
- Test at small scale first — Use
--limit 50to verify setup before full evaluation
Troubleshooting
Evaluation hangs on multi-GPU
# Check NCCL configuration export NCCL_DEBUG=INFO # Reduce tensor parallelism tensor_parallel_size: 1
NeMo checkpoint not loading
# Convert HF model to NeMo format python -m nemo.collections.llm.tools.convert_hf \ --input_path ./hf-model \ --output_path ./nemo-model
LLM judge scores inconsistent
# Add temperature=0 for deterministic judging judge_config["temperature"] = 0 # Use majority vote across 3 judgments judge_config["num_judgments"] = 3 judge_config["aggregation"] = "majority"
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.