Post Training Expert
Enterprise-grade skill for simple, preference, optimization, alignment. Includes structured workflows, validation checks, and reusable patterns for ai research.
Post-Training Expert
Comprehensive guide to post-training techniques for large language models including RLHF, DPO, GRPO, SFT, and reward modeling, with framework recommendations for different scales.
When to Use
Post-training is needed when:
- Aligning a base model with human preferences (helpfulness, harmlessness)
- Teaching specific output formats or behaviors
- Improving reasoning, coding, or domain-specific capabilities
- Reducing harmful or unsafe outputs
Choose the right technique:
- SFT — supervised fine-tuning on instruction-response pairs (simplest)
- DPO — direct preference optimization from preference pairs (moderate complexity)
- GRPO — group relative policy optimization with custom rewards (RL-based)
- PPO — proximal policy optimization with reward models (most complex)
Quick Start
SFT (Start Here)
from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") config = SFTConfig( output_dir="./sft-output", max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, num_train_epochs=3, bf16=True, ) trainer = SFTTrainer( model=model, config=config, tokenizer=tokenizer, train_dataset=train_dataset, # instruction-response pairs ) trainer.train()
DPO (Preference Learning)
from trl import DPOTrainer, DPOConfig config = DPOConfig( output_dir="./dpo-output", beta=0.1, per_device_train_batch_size=4, learning_rate=5e-7, num_train_epochs=1, bf16=True, ) trainer = DPOTrainer( model=model, ref_model=ref_model, # Frozen copy of the base model config=config, tokenizer=tokenizer, train_dataset=preference_dataset, # chosen + rejected pairs ) trainer.train()
Core Concepts
Post-Training Pipeline
Base Model → SFT → Preference Alignment (DPO/PPO) → Safety Tuning → Deployment
1. SFT: Teach instruction following with curated examples
2. Preference: Align with human preferences using ranked outputs
3. Safety: Apply constitutional AI or safety-specific training
4. Evaluation: Benchmark against MT-Bench, AlpacaEval, etc.
Technique Comparison
| Technique | Data Required | Complexity | Memory | Quality |
|---|---|---|---|---|
| SFT | Instruction-response pairs | Low | Low | Good baseline |
| DPO | Preference pairs (chosen/rejected) | Medium | Medium | Strong |
| GRPO | Prompts + reward functions | Medium-High | Medium | Excellent for verifiable tasks |
| PPO | Prompts + reward model | High | High (4 models) | Best for general alignment |
Framework Selection
| Scale | Framework | Best For |
|---|---|---|
| Single GPU | TRL | SFT, DPO, small GRPO |
| Multi-GPU | TRL + DeepSpeed | DPO, GRPO up to 70B |
| Multi-node | OpenRLHF | PPO, GRPO at scale |
| Enterprise (1TB+) | Miles | MoE models, FP8 training |
| Research | torchforge | Custom RL algorithms |
Configuration
SFT Parameters
| Parameter | Recommended | Description |
|---|---|---|
learning_rate | 2e-5 | Standard for SFT |
num_train_epochs | 2-3 | More epochs risk overfitting |
max_seq_length | 2048-4096 | Match your target use case |
bf16 | True | Half precision training |
DPO Parameters
| Parameter | Recommended | Description |
|---|---|---|
beta | 0.1 | KL penalty strength |
learning_rate | 5e-7 | Lower than SFT |
num_train_epochs | 1 | DPO converges quickly |
max_length | 2048 | Max sequence length |
PPO/GRPO Parameters
| Parameter | Recommended | Description |
|---|---|---|
learning_rate | 1e-6 | Very low for stability |
kl_coef | 0.01-0.1 | KL divergence penalty |
num_generations | 4-8 | Group size for GRPO |
clip_range | 0.2 | Policy update clipping |
Best Practices
- Always start with SFT — establish a strong instruction-following baseline before alignment
- Use DPO for most preference tasks — simpler than PPO, often comparable results
- Reserve PPO for complex reward signals that can't be captured by preference pairs
- Use GRPO for verifiable tasks — math, code, structured output where correctness is checkable
- Curate high-quality SFT data — 10K excellent examples beat 100K noisy ones
- Evaluate continuously — track MT-Bench, safety benchmarks, and domain-specific metrics throughout training
Common Issues
Model degrades after alignment:
The base SFT model was likely undertrained. Ensure SFT quality is strong before applying DPO/PPO. Increase beta in DPO to constrain divergence from the reference model.
DPO not improving over SFT: Check preference data quality — chosen responses must be meaningfully better than rejected ones. Ensure the preference pairs represent real quality differences, not arbitrary labeling.
PPO training collapse: Monitor KL divergence and reward scores. If KL exceeds 15, reduce learning rate. If reward saturates early, the reward model may be too simple — train a stronger one.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.