Fine Tuning Peft System
Production-ready skill that handles parameter, efficient, fine, tuning. Includes structured workflows, validation checks, and reusable patterns for ai research.
Fine-Tuning with PEFT (Parameter-Efficient Fine-Tuning)
Overview
A comprehensive skill for fine-tuning LLMs using HuggingFace PEFT — the library that implements LoRA, QLoRA, Prefix Tuning, Prompt Tuning, IA3, and other parameter-efficient methods. PEFT enables fine-tuning billion-parameter models on consumer GPUs by updating only a small fraction of parameters, reducing memory by 10-100x while maintaining 95-99% of full fine-tuning quality.
When to Use
- Fine-tuning large models on limited GPU memory
- Need to maintain multiple task-specific adapters efficiently
- Want to fine-tune with <1% of total parameters
- Training on consumer GPUs (RTX 3090, 4090, etc.)
- Need hot-swappable adapters for multi-task serving
- Combining with quantization for extreme efficiency (QLoRA)
Quick Start
pip install peft transformers accelerate bitsandbytes
from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto", ) # Add LoRA adapter lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # Rank — size of update matrices lora_alpha=32, # Scaling factor lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 13,631,488 || all params: 8,030,261,248 || trainable%: 0.1698%
PEFT Methods
LoRA (Low-Rank Adaptation)
from peft import LoraConfig config = LoraConfig( r=16, # Rank of update matrices lora_alpha=32, # Scaling factor (alpha/r) target_modules="all-linear", # Apply to all linear layers lora_dropout=0.05, bias="none", # Don't train biases task_type="CAUSAL_LM", ) # How LoRA works: # Original: Y = W·X (W is frozen, d×d) # LoRA: Y = W·X + BA·X (A is d×r, B is r×d, only A,B are trained) # Memory: 2×d×r << d×d (r=16 << d=4096)
QLoRA (Quantized LoRA)
from transformers import BitsAndBytesConfig from peft import LoraConfig, prepare_model_for_kbit_training # Load model in 4-bit bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8B-Instruct", quantization_config=bnb_config, device_map="auto", ) # Prepare for training model = prepare_model_for_kbit_training(model) # Add LoRA on top of 4-bit model lora_config = LoraConfig( r=32, lora_alpha=64, target_modules="all-linear", lora_dropout=0.05, ) model = get_peft_model(model, lora_config)
Training Loop
from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-4, # Higher LR for LoRA than full fine-tuning bf16=True, logging_steps=10, save_steps=500, evaluation_strategy="steps", eval_steps=500, lr_scheduler_type="cosine", warmup_ratio=0.1, optim="paged_adamw_8bit", # Memory-efficient optimizer ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, ) trainer.train() # Save adapter (small, ~50MB) model.save_pretrained("./lora-adapter")
PEFT Method Comparison
| Method | Trainable Params | Memory | Quality | Use Case |
|---|---|---|---|---|
| LoRA | 0.1-1% | Low | Excellent | Default choice |
| QLoRA | 0.1-1% | Very low | Very good | Consumer GPUs |
| Prefix Tuning | <0.1% | Very low | Good | Short tasks |
| Prompt Tuning | <0.01% | Minimal | Fair | Few-shot adaptation |
| IA3 | <0.01% | Minimal | Good | Multi-task |
| Full Fine-Tune | 100% | Very high | Best | When you have resources |
Adapter Management
from peft import PeftModel # Load base model + adapter base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B") model = PeftModel.from_pretrained(base_model, "./lora-adapter") # Merge adapter into base model (for deployment) merged_model = model.merge_and_unload() merged_model.save_pretrained("./merged-model") # Hot-swap adapters model.load_adapter("./adapter-task1", adapter_name="task1") model.load_adapter("./adapter-task2", adapter_name="task2") model.set_adapter("task1") # Switch to task1 adapter model.set_adapter("task2") # Switch to task2 adapter
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
r | 8 | LoRA rank (higher = more capacity, more memory) |
lora_alpha | 8 | Scaling factor (effective scale = alpha/r) |
lora_dropout | 0.0 | Dropout on LoRA layers |
target_modules | None | Which layers to apply LoRA to |
bias | none | none, all, or lora_only |
modules_to_save | None | Full layers to train (e.g., classifier head) |
fan_in_fan_out | False | For Conv1D layers |
Best Practices
- Use rank 16-32 — Lower ranks are faster but may lose quality; higher rarely helps
- Apply to all linear layers —
target_modules="all-linear"beats selecting specific modules - Set alpha = 2 × rank — Good default scaling ratio
- Use higher learning rates — LoRA needs 5-10x higher LR than full fine-tuning (2e-4 vs 2e-5)
- Merge for deployment —
.merge_and_unload()eliminates adapter overhead in production - Keep adapters small — Typical LoRA adapter is 50-200MB vs 16GB+ for full model
- Use QLoRA for large models — Enables 70B fine-tuning on a single 48GB GPU
- Hot-swap adapters — Load multiple task-specific adapters and switch at runtime
- Save adapter separately — Only save the adapter, not the full model
- Validate with the base model — Ensure improvements are from LoRA, not dataset overlap
Troubleshooting
Loss not decreasing
# Increase rank lora_config = LoraConfig(r=64, lora_alpha=128) # Or apply to more layers lora_config = LoraConfig(target_modules="all-linear") # Or increase learning rate training_args = TrainingArguments(learning_rate=5e-4)
Quality worse than base model
# Reduce training — likely overfitting training_args = TrainingArguments(num_train_epochs=1) # Add dropout lora_config = LoraConfig(lora_dropout=0.1) # Check data quality — bad data = bad model
Merge fails
# Ensure model and adapter are on same device model = model.to("cpu") merged = model.merge_and_unload() # For quantized models, dequantize first
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.