Rapid Model Training TRL Toolkit
All-in-one skill for managing train language models with SFT, DPO, and GRPO. Built for Claude Code with best practices and real-world patterns.
Model Training with TRL Toolkit
Advanced model training guide using TRL (Transformer Reinforcement Learning) for fine-tuning language models with SFT, RLHF, DPO, and reward modeling techniques on HuggingFace infrastructure.
When to Use This Skill
Choose Model Training with TRL when:
- Fine-tuning LLMs with supervised learning (SFT)
- Implementing RLHF (Reinforcement Learning from Human Feedback)
- Training with Direct Preference Optimization (DPO)
- Building reward models for preference-based training
- Aligning models with human preferences efficiently
Consider alternatives when:
- Need basic fine-tuning without alignment — use standard Transformers Trainer
- Need API-based model customization — use provider fine-tuning APIs
- Need pre-training from scratch — use Megatron-LM or NeMo
Quick Start
# Install TRL pip install trl peft accelerate bitsandbytes # Activate TRL toolkit claude skill activate rapid-model-training-trl-toolkit # Fine-tune a model claude "Fine-tune Llama-3 with DPO using our preference dataset"
Example: SFT + DPO Training Pipeline
from trl import SFTTrainer, DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig from datasets import load_dataset # Stage 1: Supervised Fine-Tuning (SFT) model_name = "meta-llama/Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token lora_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], task_type="CAUSAL_LM", ) sft_trainer = SFTTrainer( model=model_name, train_dataset=load_dataset("my-org/instruction-data", split="train"), peft_config=lora_config, max_seq_length=2048, args=TrainingArguments( output_dir="./sft-output", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, bf16=True, ), ) sft_trainer.train() sft_model = sft_trainer.model # Stage 2: Direct Preference Optimization (DPO) dpo_dataset = load_dataset("my-org/preference-data", split="train") dpo_config = DPOConfig( output_dir="./dpo-output", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=5e-7, beta=0.1, # KL penalty coefficient bf16=True, remove_unused_columns=False, ) dpo_trainer = DPOTrainer( model=sft_model, ref_model=None, # Uses implicit reference with PEFT args=dpo_config, train_dataset=dpo_dataset, tokenizer=tokenizer, ) dpo_trainer.train()
Core Concepts
Training Methods
| Method | Description | Data Required |
|---|---|---|
| SFT | Supervised fine-tuning on instruction data | (instruction, response) pairs |
| RLHF | RL with human feedback via reward model | Preferences + reward model |
| DPO | Direct preference optimization (no reward model) | (prompt, chosen, rejected) triples |
| KTO | Kahneman-Tversky Optimization | (prompt, response, good/bad) labels |
| ORPO | Odds Ratio Preference Optimization | (prompt, chosen, rejected) triples |
| PPO | Proximal Policy Optimization | Reward model scores |
Training Pipeline
| Stage | Purpose | Output |
|---|---|---|
| Pre-training | Learn language structure | Base model |
| SFT | Learn instruction following | SFT model |
| Reward Modeling | Learn human preferences | Reward model |
| RLHF/DPO | Align with preferences | Aligned model |
| Evaluation | Measure quality | Benchmark scores |
| Deployment | Serve in production | Inference endpoint |
Configuration
| Parameter | Description | Default |
|---|---|---|
training_method | Method: sft, dpo, rlhf, kto, orpo | sft |
lora_r | LoRA rank | 16 |
lora_alpha | LoRA alpha scaling | 32 |
beta | DPO KL penalty coefficient | 0.1 |
max_seq_length | Maximum sequence length | 2048 |
gradient_checkpointing | Save memory with recomputation | true |
quantization | Base model quantization: none, 4bit, 8bit | 4bit |
Best Practices
-
Start with SFT before alignment training — A model needs to follow instructions before it can be aligned with preferences. Always start with SFT on high-quality instruction data, then apply DPO or RLHF as a refinement step.
-
Use DPO over RLHF for simplicity and stability — DPO achieves comparable results to RLHF without training a separate reward model or dealing with PPO instability. It directly optimizes preferences and requires less compute and hyperparameter tuning.
-
Use QLoRA (4-bit quantized LoRA) for training large models on limited hardware — QLoRA enables fine-tuning a 70B parameter model on a single A100 GPU. The quality loss from quantization is minimal when combined with LoRA adaptation.
-
Curate high-quality preference data over quantity — 5,000 high-quality preference pairs often outperform 50,000 noisy pairs. Invest in clear preference criteria, consistent annotators, and quality filtering rather than maximizing dataset size.
-
Evaluate with diverse benchmarks, not just loss curves — Training loss doesn't predict real-world quality. Evaluate with MT-Bench, AlpacaEval, human evaluations, and domain-specific benchmarks to get a complete picture of model capability.
Common Issues
DPO training loss doesn't decrease or oscillates. Common causes: learning rate too high (try 1e-7 to 5e-6), beta value inappropriate for your data (try 0.05 to 0.5), or preference data has inconsistent labeling. Start with a very low learning rate and increase gradually.
Fine-tuned model generates repetitive or degenerate text. The model may have overfit to training data patterns. Reduce training epochs, increase dropout, use early stopping based on eval loss, and add diverse evaluation prompts. Check that training data doesn't contain excessive repetition.
Out-of-memory errors during training despite using LoRA. Enable gradient checkpointing (gradient_checkpointing=True), reduce batch size and increase gradient accumulation steps, and use 4-bit quantization for the base model. If still OOM, reduce max_seq_length or lora_r.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.