Post Training Grpo Engine
Production-ready skill that handles expert, guidance, grpo, fine. Includes structured workflows, validation checks, and reusable patterns for ai research.
GRPO Post-Training Engine
Implement Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library for fine-tuning language models with custom reward functions. Ideal for enforcing structured outputs, teaching verifiable reasoning, and domain-specific optimization.
When to Use
GRPO training is ideal when:
- You need to enforce specific output formats (XML tags, JSON, structured reasoning)
- Teaching verifiable tasks where correctness can be programmatically checked
- Domain-specific optimization with custom reward signals
- Math, code, and reasoning tasks where answers are objectively verifiable
Consider alternatives when:
- Simple instruction following → use SFT (supervised fine-tuning)
- Human preference alignment → use DPO or RLHF with human feedback
- No clear reward signal exists → use RLHF with learned reward models
Quick Start
Installation
pip install trl transformers accelerate peft
Basic GRPO Training
from trl import GRPOConfig, GRPOTrainer from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer.pad_token = tokenizer.eos_token # Define a reward function def reward_fn(completions, prompts): rewards = [] for completion in completions: # Reward structured XML output if "<answer>" in completion and "</answer>" in completion: rewards.append(1.0) else: rewards.append(-1.0) return rewards config = GRPOConfig( output_dir="./grpo-output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=1e-6, num_generations=4, # Responses per prompt for group comparison max_completion_length=512, logging_steps=10, ) trainer = GRPOTrainer( model=model, config=config, tokenizer=tokenizer, train_dataset=train_dataset, reward_funcs=[reward_fn], ) trainer.train()
Core Concepts
How GRPO Works
Unlike PPO which requires a separate critic model, GRPO:
- Generates multiple completions per prompt (a "group")
- Scores each completion with reward functions
- Computes advantages relative to the group mean (no critic needed)
- Updates policy using clipped surrogate objective
This eliminates the critic model, reducing memory by ~50% compared to PPO.
Reward Function Design
Reward functions are the heart of GRPO. They must return a numeric score for each completion:
# Format reward — checks structural compliance def format_reward(completions, prompts): import re rewards = [] for c in completions: has_think = bool(re.search(r"<think>.*?</think>", c, re.DOTALL)) has_answer = bool(re.search(r"<answer>.*?</answer>", c, re.DOTALL)) rewards.append(1.0 if (has_think and has_answer) else -0.5) return rewards # Correctness reward — checks answer accuracy def correctness_reward(completions, prompts): rewards = [] for c, p in zip(completions, prompts): expected = p.get("expected_answer", "") answer_match = re.search(r"<answer>(.*?)</answer>", c) if answer_match and answer_match.group(1).strip() == expected: rewards.append(2.0) else: rewards.append(0.0) return rewards # Combine multiple reward functions trainer = GRPOTrainer( model=model, config=config, tokenizer=tokenizer, train_dataset=dataset, reward_funcs=[format_reward, correctness_reward], )
Multi-Reward Composition
| Reward Type | Purpose | Weight |
|---|---|---|
| Format | Structural compliance | 0.3 |
| Correctness | Answer accuracy | 0.5 |
| Brevity | Concise responses | 0.1 |
| Safety | Content filtering | 0.1 |
Configuration
| Parameter | Default | Description |
|---|---|---|
num_generations | 4 | Completions per prompt for group comparison |
max_completion_length | 512 | Max tokens per generated completion |
learning_rate | 1e-6 | Learning rate (lower than SFT) |
kl_coef | 0.1 | KL divergence penalty coefficient |
clip_range | 0.2 | PPO-style clipping range |
temperature | 0.7 | Sampling temperature for generations |
num_train_epochs | 3 | Training epochs |
gradient_accumulation_steps | 4 | Steps before optimizer update |
Best Practices
- Start with format rewards before adding correctness — get the model producing structured output first
- Use 4-8 generations per prompt for stable advantage estimation
- Set learning_rate to 1e-6 or lower — RL fine-tuning is sensitive to large updates
- Monitor KL divergence — if it spikes, reduce learning rate or increase
kl_coef - Combine multiple reward functions with appropriate weights for nuanced behavior
- Use LoRA/PEFT for memory-efficient training on consumer hardware
Common Issues
Reward hacking (model finds shortcuts): Add auxiliary reward functions that penalize degenerate outputs. Monitor generation quality throughout training, not just reward scores.
Training instability:
Reduce learning rate, increase kl_coef, or reduce clip_range. GRPO is more stable than PPO but still requires careful hyperparameter tuning.
High memory usage:
Reduce num_generations from 8 to 4. Use gradient checkpointing and LoRA adapters. Consider DeepSpeed ZeRO Stage 2 for multi-GPU setups.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.