Model Training with TRL Toolkit

Advanced model training guide using TRL (Transformer Reinforcement Learning) for fine-tuning language models with SFT, RLHF, DPO, and reward modeling techniques on HuggingFace infrastructure.

When to Use This Skill

Choose Model Training with TRL when:

Fine-tuning LLMs with supervised learning (SFT)
Implementing RLHF (Reinforcement Learning from Human Feedback)
Training with Direct Preference Optimization (DPO)
Building reward models for preference-based training
Aligning models with human preferences efficiently

Consider alternatives when:

Need basic fine-tuning without alignment — use standard Transformers Trainer
Need API-based model customization — use provider fine-tuning APIs
Need pre-training from scratch — use Megatron-LM or NeMo

Quick Start


# Install TRL
pip install trl peft accelerate bitsandbytes

# Activate TRL toolkit
claude skill activate rapid-model-training-trl-toolkit

# Fine-tune a model
claude "Fine-tune Llama-3 with DPO using our preference dataset"

Example: SFT + DPO Training Pipeline


from trl import SFTTrainer, DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from datasets import load_dataset

# Stage 1: Supervised Fine-Tuning (SFT)
model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

sft_trainer = SFTTrainer(
    model=model_name,
    train_dataset=load_dataset("my-org/instruction-data", split="train"),
    peft_config=lora_config,
    max_seq_length=2048,
    args=TrainingArguments(
        output_dir="./sft-output",
        num_train_epochs=1,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
    ),
)

sft_trainer.train()
sft_model = sft_trainer.model

# Stage 2: Direct Preference Optimization (DPO)
dpo_dataset = load_dataset("my-org/preference-data", split="train")

dpo_config = DPOConfig(
    output_dir="./dpo-output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,
    beta=0.1,  # KL penalty coefficient
    bf16=True,
    remove_unused_columns=False,
)

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,  # Uses implicit reference with PEFT
    args=dpo_config,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

Core Concepts

Training Methods

Method	Description	Data Required
SFT	Supervised fine-tuning on instruction data	(instruction, response) pairs
RLHF	RL with human feedback via reward model	Preferences + reward model
DPO	Direct preference optimization (no reward model)	(prompt, chosen, rejected) triples
KTO	Kahneman-Tversky Optimization	(prompt, response, good/bad) labels
ORPO	Odds Ratio Preference Optimization	(prompt, chosen, rejected) triples
PPO	Proximal Policy Optimization	Reward model scores

Training Pipeline

Stage	Purpose	Output
Pre-training	Learn language structure	Base model
SFT	Learn instruction following	SFT model
Reward Modeling	Learn human preferences	Reward model
RLHF/DPO	Align with preferences	Aligned model
Evaluation	Measure quality	Benchmark scores
Deployment	Serve in production	Inference endpoint

Configuration

Parameter	Description	Default
`training_method`	Method: sft, dpo, rlhf, kto, orpo	`sft`
`lora_r`	LoRA rank	`16`
`lora_alpha`	LoRA alpha scaling	`32`
`beta`	DPO KL penalty coefficient	`0.1`
`max_seq_length`	Maximum sequence length	`2048`
`gradient_checkpointing`	Save memory with recomputation	`true`
`quantization`	Base model quantization: none, 4bit, 8bit	`4bit`

Best Practices

Start with SFT before alignment training — A model needs to follow instructions before it can be aligned with preferences. Always start with SFT on high-quality instruction data, then apply DPO or RLHF as a refinement step.
Use DPO over RLHF for simplicity and stability — DPO achieves comparable results to RLHF without training a separate reward model or dealing with PPO instability. It directly optimizes preferences and requires less compute and hyperparameter tuning.
Use QLoRA (4-bit quantized LoRA) for training large models on limited hardware — QLoRA enables fine-tuning a 70B parameter model on a single A100 GPU. The quality loss from quantization is minimal when combined with LoRA adaptation.
Curate high-quality preference data over quantity — 5,000 high-quality preference pairs often outperform 50,000 noisy pairs. Invest in clear preference criteria, consistent annotators, and quality filtering rather than maximizing dataset size.
Evaluate with diverse benchmarks, not just loss curves — Training loss doesn't predict real-world quality. Evaluate with MT-Bench, AlpacaEval, human evaluations, and domain-specific benchmarks to get a complete picture of model capability.

Common Issues

DPO training loss doesn't decrease or oscillates. Common causes: learning rate too high (try 1e-7 to 5e-6), beta value inappropriate for your data (try 0.05 to 0.5), or preference data has inconsistent labeling. Start with a very low learning rate and increase gradually.

Fine-tuned model generates repetitive or degenerate text. The model may have overfit to training data patterns. Reduce training epochs, increase dropout, use early stopping based on eval loss, and add diverse evaluation prompts. Check that training data doesn't contain excessive repetition.

Out-of-memory errors during training despite using LoRA. Enable gradient checkpointing (gradient_checkpointing=True), reduce batch size and increase gradient accumulation steps, and use 4-bit quantization for the base model. If still OOM, reduce max_seq_length or lora_r.

⚠️ Loading Issue

Rapid Model Training TRL Toolkit

Model Training with TRL Toolkit

When to Use This Skill

Quick Start

Example: SFT + DPO Training Pipeline

Core Concepts

Training Methods

Training Pipeline

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace