Post-Training Expert

Comprehensive guide to post-training techniques for large language models including RLHF, DPO, GRPO, SFT, and reward modeling, with framework recommendations for different scales.

When to Use

Post-training is needed when:

Aligning a base model with human preferences (helpfulness, harmlessness)
Teaching specific output formats or behaviors
Improving reasoning, coding, or domain-specific capabilities
Reducing harmful or unsafe outputs

Choose the right technique:

SFT — supervised fine-tuning on instruction-response pairs (simplest)
DPO — direct preference optimization from preference pairs (moderate complexity)
GRPO — group relative policy optimization with custom rewards (RL-based)
PPO — proximal policy optimization with reward models (most complex)

Quick Start

SFT (Start Here)


from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

config = SFTConfig(
    output_dir="./sft-output",
    max_seq_length=2048,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    num_train_epochs=3,
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    config=config,
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # instruction-response pairs
)

trainer.train()

DPO (Preference Learning)


from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    output_dir="./dpo-output",
    beta=0.1,
    per_device_train_batch_size=4,
    learning_rate=5e-7,
    num_train_epochs=1,
    bf16=True,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,  # Frozen copy of the base model
    config=config,
    tokenizer=tokenizer,
    train_dataset=preference_dataset,  # chosen + rejected pairs
)

trainer.train()

Core Concepts

Post-Training Pipeline

Base Model → SFT → Preference Alignment (DPO/PPO) → Safety Tuning → Deployment

1. SFT: Teach instruction following with curated examples
2. Preference: Align with human preferences using ranked outputs
3. Safety: Apply constitutional AI or safety-specific training
4. Evaluation: Benchmark against MT-Bench, AlpacaEval, etc.

Technique Comparison

Technique	Data Required	Complexity	Memory	Quality
SFT	Instruction-response pairs	Low	Low	Good baseline
DPO	Preference pairs (chosen/rejected)	Medium	Medium	Strong
GRPO	Prompts + reward functions	Medium-High	Medium	Excellent for verifiable tasks
PPO	Prompts + reward model	High	High (4 models)	Best for general alignment

Framework Selection

Scale	Framework	Best For
Single GPU	TRL	SFT, DPO, small GRPO
Multi-GPU	TRL + DeepSpeed	DPO, GRPO up to 70B
Multi-node	OpenRLHF	PPO, GRPO at scale
Enterprise (1TB+)	Miles	MoE models, FP8 training
Research	torchforge	Custom RL algorithms

Configuration

SFT Parameters

Parameter	Recommended	Description
`learning_rate`	2e-5	Standard for SFT
`num_train_epochs`	2-3	More epochs risk overfitting
`max_seq_length`	2048-4096	Match your target use case
`bf16`	True	Half precision training

DPO Parameters

Parameter	Recommended	Description
`beta`	0.1	KL penalty strength
`learning_rate`	5e-7	Lower than SFT
`num_train_epochs`	1	DPO converges quickly
`max_length`	2048	Max sequence length

PPO/GRPO Parameters

Parameter	Recommended	Description
`learning_rate`	1e-6	Very low for stability
`kl_coef`	0.01-0.1	KL divergence penalty
`num_generations`	4-8	Group size for GRPO
`clip_range`	0.2	Policy update clipping

Best Practices

Always start with SFT — establish a strong instruction-following baseline before alignment
Use DPO for most preference tasks — simpler than PPO, often comparable results
Reserve PPO for complex reward signals that can't be captured by preference pairs
Use GRPO for verifiable tasks — math, code, structured output where correctness is checkable
Curate high-quality SFT data — 10K excellent examples beat 100K noisy ones
Evaluate continuously — track MT-Bench, safety benchmarks, and domain-specific metrics throughout training

Common Issues

Model degrades after alignment: The base SFT model was likely undertrained. Ensure SFT quality is strong before applying DPO/PPO. Increase beta in DPO to constrain divergence from the reference model.

DPO not improving over SFT: Check preference data quality — chosen responses must be meaningfully better than rejected ones. Ensure the preference pairs represent real quality differences, not arbitrary labeling.

PPO training collapse: Monitor KL divergence and reward scores. If KL exceeds 15, reduce learning rate. If reward saturates early, the reward model may be too simple — train a stronger one.

⚠️ Loading Issue

Post Training Expert