P

Post Training Expert

Enterprise-grade skill for simple, preference, optimization, alignment. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Post-Training Expert

Comprehensive guide to post-training techniques for large language models including RLHF, DPO, GRPO, SFT, and reward modeling, with framework recommendations for different scales.

When to Use

Post-training is needed when:

  • Aligning a base model with human preferences (helpfulness, harmlessness)
  • Teaching specific output formats or behaviors
  • Improving reasoning, coding, or domain-specific capabilities
  • Reducing harmful or unsafe outputs

Choose the right technique:

  • SFT — supervised fine-tuning on instruction-response pairs (simplest)
  • DPO — direct preference optimization from preference pairs (moderate complexity)
  • GRPO — group relative policy optimization with custom rewards (RL-based)
  • PPO — proximal policy optimization with reward models (most complex)

Quick Start

SFT (Start Here)

from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") config = SFTConfig( output_dir="./sft-output", max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, num_train_epochs=3, bf16=True, ) trainer = SFTTrainer( model=model, config=config, tokenizer=tokenizer, train_dataset=train_dataset, # instruction-response pairs ) trainer.train()

DPO (Preference Learning)

from trl import DPOTrainer, DPOConfig config = DPOConfig( output_dir="./dpo-output", beta=0.1, per_device_train_batch_size=4, learning_rate=5e-7, num_train_epochs=1, bf16=True, ) trainer = DPOTrainer( model=model, ref_model=ref_model, # Frozen copy of the base model config=config, tokenizer=tokenizer, train_dataset=preference_dataset, # chosen + rejected pairs ) trainer.train()

Core Concepts

Post-Training Pipeline

Base Model → SFT → Preference Alignment (DPO/PPO) → Safety Tuning → Deployment

1. SFT: Teach instruction following with curated examples
2. Preference: Align with human preferences using ranked outputs
3. Safety: Apply constitutional AI or safety-specific training
4. Evaluation: Benchmark against MT-Bench, AlpacaEval, etc.

Technique Comparison

TechniqueData RequiredComplexityMemoryQuality
SFTInstruction-response pairsLowLowGood baseline
DPOPreference pairs (chosen/rejected)MediumMediumStrong
GRPOPrompts + reward functionsMedium-HighMediumExcellent for verifiable tasks
PPOPrompts + reward modelHighHigh (4 models)Best for general alignment

Framework Selection

ScaleFrameworkBest For
Single GPUTRLSFT, DPO, small GRPO
Multi-GPUTRL + DeepSpeedDPO, GRPO up to 70B
Multi-nodeOpenRLHFPPO, GRPO at scale
Enterprise (1TB+)MilesMoE models, FP8 training
ResearchtorchforgeCustom RL algorithms

Configuration

SFT Parameters

ParameterRecommendedDescription
learning_rate2e-5Standard for SFT
num_train_epochs2-3More epochs risk overfitting
max_seq_length2048-4096Match your target use case
bf16TrueHalf precision training

DPO Parameters

ParameterRecommendedDescription
beta0.1KL penalty strength
learning_rate5e-7Lower than SFT
num_train_epochs1DPO converges quickly
max_length2048Max sequence length

PPO/GRPO Parameters

ParameterRecommendedDescription
learning_rate1e-6Very low for stability
kl_coef0.01-0.1KL divergence penalty
num_generations4-8Group size for GRPO
clip_range0.2Policy update clipping

Best Practices

  1. Always start with SFT — establish a strong instruction-following baseline before alignment
  2. Use DPO for most preference tasks — simpler than PPO, often comparable results
  3. Reserve PPO for complex reward signals that can't be captured by preference pairs
  4. Use GRPO for verifiable tasks — math, code, structured output where correctness is checkable
  5. Curate high-quality SFT data — 10K excellent examples beat 100K noisy ones
  6. Evaluate continuously — track MT-Bench, safety benchmarks, and domain-specific metrics throughout training

Common Issues

Model degrades after alignment: The base SFT model was likely undertrained. Ensure SFT quality is strong before applying DPO/PPO. Increase beta in DPO to constrain divergence from the reference model.

DPO not improving over SFT: Check preference data quality — chosen responses must be meaningfully better than rejected ones. Ensure the preference pairs represent real quality differences, not arbitrary labeling.

PPO training collapse: Monitor KL divergence and reward scores. If KL exceeds 15, reduce learning rate. If reward saturates early, the reward model may be too simple — train a stronger one.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates