R

Rapid Model Training TRL Toolkit

All-in-one skill for managing train language models with SFT, DPO, and GRPO. Built for Claude Code with best practices and real-world patterns.

SkillCommunityaiv1.0.0MIT
0 views0 copies

Model Training with TRL Toolkit

Advanced model training guide using TRL (Transformer Reinforcement Learning) for fine-tuning language models with SFT, RLHF, DPO, and reward modeling techniques on HuggingFace infrastructure.

When to Use This Skill

Choose Model Training with TRL when:

  • Fine-tuning LLMs with supervised learning (SFT)
  • Implementing RLHF (Reinforcement Learning from Human Feedback)
  • Training with Direct Preference Optimization (DPO)
  • Building reward models for preference-based training
  • Aligning models with human preferences efficiently

Consider alternatives when:

  • Need basic fine-tuning without alignment — use standard Transformers Trainer
  • Need API-based model customization — use provider fine-tuning APIs
  • Need pre-training from scratch — use Megatron-LM or NeMo

Quick Start

# Install TRL pip install trl peft accelerate bitsandbytes # Activate TRL toolkit claude skill activate rapid-model-training-trl-toolkit # Fine-tune a model claude "Fine-tune Llama-3 with DPO using our preference dataset"

Example: SFT + DPO Training Pipeline

from trl import SFTTrainer, DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig from datasets import load_dataset # Stage 1: Supervised Fine-Tuning (SFT) model_name = "meta-llama/Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token lora_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], task_type="CAUSAL_LM", ) sft_trainer = SFTTrainer( model=model_name, train_dataset=load_dataset("my-org/instruction-data", split="train"), peft_config=lora_config, max_seq_length=2048, args=TrainingArguments( output_dir="./sft-output", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, bf16=True, ), ) sft_trainer.train() sft_model = sft_trainer.model # Stage 2: Direct Preference Optimization (DPO) dpo_dataset = load_dataset("my-org/preference-data", split="train") dpo_config = DPOConfig( output_dir="./dpo-output", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=5e-7, beta=0.1, # KL penalty coefficient bf16=True, remove_unused_columns=False, ) dpo_trainer = DPOTrainer( model=sft_model, ref_model=None, # Uses implicit reference with PEFT args=dpo_config, train_dataset=dpo_dataset, tokenizer=tokenizer, ) dpo_trainer.train()

Core Concepts

Training Methods

MethodDescriptionData Required
SFTSupervised fine-tuning on instruction data(instruction, response) pairs
RLHFRL with human feedback via reward modelPreferences + reward model
DPODirect preference optimization (no reward model)(prompt, chosen, rejected) triples
KTOKahneman-Tversky Optimization(prompt, response, good/bad) labels
ORPOOdds Ratio Preference Optimization(prompt, chosen, rejected) triples
PPOProximal Policy OptimizationReward model scores

Training Pipeline

StagePurposeOutput
Pre-trainingLearn language structureBase model
SFTLearn instruction followingSFT model
Reward ModelingLearn human preferencesReward model
RLHF/DPOAlign with preferencesAligned model
EvaluationMeasure qualityBenchmark scores
DeploymentServe in productionInference endpoint

Configuration

ParameterDescriptionDefault
training_methodMethod: sft, dpo, rlhf, kto, orposft
lora_rLoRA rank16
lora_alphaLoRA alpha scaling32
betaDPO KL penalty coefficient0.1
max_seq_lengthMaximum sequence length2048
gradient_checkpointingSave memory with recomputationtrue
quantizationBase model quantization: none, 4bit, 8bit4bit

Best Practices

  1. Start with SFT before alignment training — A model needs to follow instructions before it can be aligned with preferences. Always start with SFT on high-quality instruction data, then apply DPO or RLHF as a refinement step.

  2. Use DPO over RLHF for simplicity and stability — DPO achieves comparable results to RLHF without training a separate reward model or dealing with PPO instability. It directly optimizes preferences and requires less compute and hyperparameter tuning.

  3. Use QLoRA (4-bit quantized LoRA) for training large models on limited hardware — QLoRA enables fine-tuning a 70B parameter model on a single A100 GPU. The quality loss from quantization is minimal when combined with LoRA adaptation.

  4. Curate high-quality preference data over quantity — 5,000 high-quality preference pairs often outperform 50,000 noisy pairs. Invest in clear preference criteria, consistent annotators, and quality filtering rather than maximizing dataset size.

  5. Evaluate with diverse benchmarks, not just loss curves — Training loss doesn't predict real-world quality. Evaluate with MT-Bench, AlpacaEval, human evaluations, and domain-specific benchmarks to get a complete picture of model capability.

Common Issues

DPO training loss doesn't decrease or oscillates. Common causes: learning rate too high (try 1e-7 to 5e-6), beta value inappropriate for your data (try 0.05 to 0.5), or preference data has inconsistent labeling. Start with a very low learning rate and increase gradually.

Fine-tuned model generates repetitive or degenerate text. The model may have overfit to training data patterns. Reduce training epochs, increase dropout, use early stopping based on eval loss, and add diverse evaluation prompts. Check that training data doesn't contain excessive repetition.

Out-of-memory errors during training despite using LoRA. Enable gradient checkpointing (gradient_checkpointing=True), reduce batch size and increase gradient accumulation steps, and use 4-bit quantization for the base model. If still OOM, reduce max_seq_length or lora_r.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates