P

Post Training Grpo Engine

Production-ready skill that handles expert, guidance, grpo, fine. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

GRPO Post-Training Engine

Implement Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library for fine-tuning language models with custom reward functions. Ideal for enforcing structured outputs, teaching verifiable reasoning, and domain-specific optimization.

When to Use

GRPO training is ideal when:

  • You need to enforce specific output formats (XML tags, JSON, structured reasoning)
  • Teaching verifiable tasks where correctness can be programmatically checked
  • Domain-specific optimization with custom reward signals
  • Math, code, and reasoning tasks where answers are objectively verifiable

Consider alternatives when:

  • Simple instruction following → use SFT (supervised fine-tuning)
  • Human preference alignment → use DPO or RLHF with human feedback
  • No clear reward signal exists → use RLHF with learned reward models

Quick Start

Installation

pip install trl transformers accelerate peft

Basic GRPO Training

from trl import GRPOConfig, GRPOTrainer from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer.pad_token = tokenizer.eos_token # Define a reward function def reward_fn(completions, prompts): rewards = [] for completion in completions: # Reward structured XML output if "<answer>" in completion and "</answer>" in completion: rewards.append(1.0) else: rewards.append(-1.0) return rewards config = GRPOConfig( output_dir="./grpo-output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=1e-6, num_generations=4, # Responses per prompt for group comparison max_completion_length=512, logging_steps=10, ) trainer = GRPOTrainer( model=model, config=config, tokenizer=tokenizer, train_dataset=train_dataset, reward_funcs=[reward_fn], ) trainer.train()

Core Concepts

How GRPO Works

Unlike PPO which requires a separate critic model, GRPO:

  1. Generates multiple completions per prompt (a "group")
  2. Scores each completion with reward functions
  3. Computes advantages relative to the group mean (no critic needed)
  4. Updates policy using clipped surrogate objective

This eliminates the critic model, reducing memory by ~50% compared to PPO.

Reward Function Design

Reward functions are the heart of GRPO. They must return a numeric score for each completion:

# Format reward — checks structural compliance def format_reward(completions, prompts): import re rewards = [] for c in completions: has_think = bool(re.search(r"<think>.*?</think>", c, re.DOTALL)) has_answer = bool(re.search(r"<answer>.*?</answer>", c, re.DOTALL)) rewards.append(1.0 if (has_think and has_answer) else -0.5) return rewards # Correctness reward — checks answer accuracy def correctness_reward(completions, prompts): rewards = [] for c, p in zip(completions, prompts): expected = p.get("expected_answer", "") answer_match = re.search(r"<answer>(.*?)</answer>", c) if answer_match and answer_match.group(1).strip() == expected: rewards.append(2.0) else: rewards.append(0.0) return rewards # Combine multiple reward functions trainer = GRPOTrainer( model=model, config=config, tokenizer=tokenizer, train_dataset=dataset, reward_funcs=[format_reward, correctness_reward], )

Multi-Reward Composition

Reward TypePurposeWeight
FormatStructural compliance0.3
CorrectnessAnswer accuracy0.5
BrevityConcise responses0.1
SafetyContent filtering0.1

Configuration

ParameterDefaultDescription
num_generations4Completions per prompt for group comparison
max_completion_length512Max tokens per generated completion
learning_rate1e-6Learning rate (lower than SFT)
kl_coef0.1KL divergence penalty coefficient
clip_range0.2PPO-style clipping range
temperature0.7Sampling temperature for generations
num_train_epochs3Training epochs
gradient_accumulation_steps4Steps before optimizer update

Best Practices

  1. Start with format rewards before adding correctness — get the model producing structured output first
  2. Use 4-8 generations per prompt for stable advantage estimation
  3. Set learning_rate to 1e-6 or lower — RL fine-tuning is sensitive to large updates
  4. Monitor KL divergence — if it spikes, reduce learning rate or increase kl_coef
  5. Combine multiple reward functions with appropriate weights for nuanced behavior
  6. Use LoRA/PEFT for memory-efficient training on consumer hardware

Common Issues

Reward hacking (model finds shortcuts): Add auxiliary reward functions that penalize degenerate outputs. Monitor generation quality throughout training, not just reward scores.

Training instability: Reduce learning rate, increase kl_coef, or reduce clip_range. GRPO is more stable than PPO but still requires careful hyperparameter tuning.

High memory usage: Reduce num_generations from 8 to 4. Use gradient checkpointing and LoRA adapters. Consider DeepSpeed ZeRO Stage 2 for multi-GPU setups.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates