E

Evaluation Nemo Evaluator System

Comprehensive skill designed for evaluates, llms, across, benchmarks. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

NVIDIA NeMo Evaluator

Overview

A comprehensive skill for evaluating LLMs using NVIDIA NeMo Evaluator — a scalable, GPU-accelerated evaluation framework that supports standard benchmarks, custom evaluations, and LLM-as-judge scoring. NeMo Evaluator integrates with NeMo Framework for end-to-end model development and supports both local and API-based model evaluation at scale.

When to Use

  • Evaluating models trained with NeMo Framework
  • Running GPU-accelerated evaluations at scale
  • Using LLM-as-judge for open-ended generation quality
  • Comparing models across multiple benchmarks simultaneously
  • Integrating evaluation into NeMo training pipelines
  • Custom evaluation with domain-specific metrics

Quick Start

# Install NeMo with evaluation support pip install nemo_toolkit[all] # Or use NVIDIA container docker run --gpus all -it nvcr.io/nvidia/nemo:25.04 # Run evaluation python -m nemo.collections.llm.evaluation.evaluate \ --model_path ./my_model \ --tasks mmlu gsm8k \ --output_dir ./eval_results

Evaluation Configuration

# eval_config.yaml model: path: ./checkpoints/llama-8b type: nemo # or hf, api evaluation: tasks: - name: mmlu num_fewshot: 5 metric: accuracy - name: gsm8k num_fewshot: 8 metric: exact_match - name: humaneval metric: pass_at_1 temperature: 0.2 batch_size: 32 precision: bf16 tensor_parallel_size: 2 output: dir: ./results format: json include_predictions: true

LLM-as-Judge Evaluation

from nemo.collections.llm.evaluation import LLMJudge judge_config = { "judge_model": "gpt-4", # or local model "criteria": [ { "name": "helpfulness", "description": "How helpful is the response to the user's query?", "scale": [1, 5], }, { "name": "accuracy", "description": "How factually accurate is the response?", "scale": [1, 5], }, { "name": "coherence", "description": "How well-organized and coherent is the response?", "scale": [1, 5], }, ], "pairwise": False, # Set True for A/B comparison } judge = LLMJudge(judge_config) results = judge.evaluate( prompts=test_prompts, responses=model_responses, ) for r in results: print(f"Helpfulness: {r['helpfulness']}, Accuracy: {r['accuracy']}")

Supported Benchmarks

BenchmarkCategoryGPU-AcceleratedDescription
MMLUKnowledgeYes57-subject multiple choice
GSM8KMathYesGrade school math problems
HumanEvalCodingYesCode generation
TruthfulQASafetyYesTruthfulness evaluation
MT-BenchChatVia LLM-JudgeMulti-turn conversation
HELMComprehensiveYesHolistic evaluation
CustomAnyYesUser-defined tasks

Configuration Reference

ParameterDefaultDescription
model.pathRequiredPath to model checkpoint
model.typenemoModel format (nemo, hf, api)
evaluation.batch_size32Evaluation batch size
evaluation.precisionbf16Compute precision
evaluation.tensor_parallel_size1GPU parallelism
evaluation.num_fewshotTask defaultFew-shot examples
output.include_predictionsfalseSave per-sample outputs
judge.modelgpt-4LLM judge model
judge.criteria[]Scoring dimensions

Best Practices

  1. Use GPU-accelerated evaluation — NeMo parallelism enables faster evaluation on large models
  2. Combine benchmarks and LLM-judge — Standard benchmarks for comparability, LLM-judge for quality
  3. Evaluate during training — Set up periodic evaluation checkpoints in NeMo training config
  4. Use multi-criteria judging — Score helpfulness, accuracy, and safety separately
  5. Match evaluation to deployment — Use same precision, batch size, and context length as production
  6. Save predictions — Enable include_predictions for error analysis
  7. Use pairwise comparison — More reliable than absolute scoring for A/B testing
  8. Track eval metrics over training — Plot evaluation curves alongside training loss
  9. Calibrate your judge — Validate LLM-judge scores against human annotations
  10. Test at small scale first — Use --limit 50 to verify setup before full evaluation

Troubleshooting

Evaluation hangs on multi-GPU

# Check NCCL configuration export NCCL_DEBUG=INFO # Reduce tensor parallelism tensor_parallel_size: 1

NeMo checkpoint not loading

# Convert HF model to NeMo format python -m nemo.collections.llm.tools.convert_hf \ --input_path ./hf-model \ --output_path ./nemo-model

LLM judge scores inconsistent

# Add temperature=0 for deterministic judging judge_config["temperature"] = 0 # Use majority vote across 3 judgments judge_config["num_judgments"] = 3 judge_config["aggregation"] = "majority"
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates