GRPO Post-Training Engine

Implement Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library for fine-tuning language models with custom reward functions. Ideal for enforcing structured outputs, teaching verifiable reasoning, and domain-specific optimization.

When to Use

GRPO training is ideal when:

You need to enforce specific output formats (XML tags, JSON, structured reasoning)
Teaching verifiable tasks where correctness can be programmatically checked
Domain-specific optimization with custom reward signals
Math, code, and reasoning tasks where answers are objectively verifiable

Consider alternatives when:

Simple instruction following → use SFT (supervised fine-tuning)
Human preference alignment → use DPO or RLHF with human feedback
No clear reward signal exists → use RLHF with learned reward models

Quick Start

Installation


pip install trl transformers accelerate peft

Basic GRPO Training


from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.pad_token = tokenizer.eos_token

# Define a reward function
def reward_fn(completions, prompts):
    rewards = []
    for completion in completions:
        # Reward structured XML output
        if "<answer>" in completion and "</answer>" in completion:
            rewards.append(1.0)
        else:
            rewards.append(-1.0)
    return rewards

config = GRPOConfig(
    output_dir="./grpo-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-6,
    num_generations=4,        # Responses per prompt for group comparison
    max_completion_length=512,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    config=config,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    reward_funcs=[reward_fn],
)

trainer.train()

Core Concepts

How GRPO Works

Unlike PPO which requires a separate critic model, GRPO:

Generates multiple completions per prompt (a "group")
Scores each completion with reward functions
Computes advantages relative to the group mean (no critic needed)
Updates policy using clipped surrogate objective

This eliminates the critic model, reducing memory by ~50% compared to PPO.

Reward Function Design

Reward functions are the heart of GRPO. They must return a numeric score for each completion:


# Format reward — checks structural compliance
def format_reward(completions, prompts):
    import re
    rewards = []
    for c in completions:
        has_think = bool(re.search(r"<think>.*?</think>", c, re.DOTALL))
        has_answer = bool(re.search(r"<answer>.*?</answer>", c, re.DOTALL))
        rewards.append(1.0 if (has_think and has_answer) else -0.5)
    return rewards

# Correctness reward — checks answer accuracy
def correctness_reward(completions, prompts):
    rewards = []
    for c, p in zip(completions, prompts):
        expected = p.get("expected_answer", "")
        answer_match = re.search(r"<answer>(.*?)</answer>", c)
        if answer_match and answer_match.group(1).strip() == expected:
            rewards.append(2.0)
        else:
            rewards.append(0.0)
    return rewards

# Combine multiple reward functions
trainer = GRPOTrainer(
    model=model,
    config=config,
    tokenizer=tokenizer,
    train_dataset=dataset,
    reward_funcs=[format_reward, correctness_reward],
)

Multi-Reward Composition

Reward Type	Purpose	Weight
Format	Structural compliance	0.3
Correctness	Answer accuracy	0.5
Brevity	Concise responses	0.1
Safety	Content filtering	0.1

Configuration

Parameter	Default	Description
`num_generations`	4	Completions per prompt for group comparison
`max_completion_length`	512	Max tokens per generated completion
`learning_rate`	1e-6	Learning rate (lower than SFT)
`kl_coef`	0.1	KL divergence penalty coefficient
`clip_range`	0.2	PPO-style clipping range
`temperature`	0.7	Sampling temperature for generations
`num_train_epochs`	3	Training epochs
`gradient_accumulation_steps`	4	Steps before optimizer update

Best Practices

Start with format rewards before adding correctness — get the model producing structured output first
Use 4-8 generations per prompt for stable advantage estimation
Set learning_rate to 1e-6 or lower — RL fine-tuning is sensitive to large updates
Monitor KL divergence — if it spikes, reduce learning rate or increase kl_coef
Combine multiple reward functions with appropriate weights for nuanced behavior
Use LoRA/PEFT for memory-efficient training on consumer hardware

Common Issues

Reward hacking (model finds shortcuts): Add auxiliary reward functions that penalize degenerate outputs. Monitor generation quality throughout training, not just reward scores.

Training instability: Reduce learning rate, increase kl_coef, or reduce clip_range. GRPO is more stable than PPO but still requires careful hyperparameter tuning.

High memory usage: Reduce num_generations from 8 to 4. Use gradient checkpointing and LoRA adapters. Consider DeepSpeed ZeRO Stage 2 for multi-GPU setups.

⚠️ Loading Issue

Post Training Grpo Engine

GRPO Post-Training Engine

When to Use

Quick Start

Installation

Basic GRPO Training

Core Concepts

How GRPO Works

Reward Function Design

Multi-Reward Composition

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace