F

Fine Tuning Peft System

Production-ready skill that handles parameter, efficient, fine, tuning. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Fine-Tuning with PEFT (Parameter-Efficient Fine-Tuning)

Overview

A comprehensive skill for fine-tuning LLMs using HuggingFace PEFT — the library that implements LoRA, QLoRA, Prefix Tuning, Prompt Tuning, IA3, and other parameter-efficient methods. PEFT enables fine-tuning billion-parameter models on consumer GPUs by updating only a small fraction of parameters, reducing memory by 10-100x while maintaining 95-99% of full fine-tuning quality.

When to Use

  • Fine-tuning large models on limited GPU memory
  • Need to maintain multiple task-specific adapters efficiently
  • Want to fine-tune with <1% of total parameters
  • Training on consumer GPUs (RTX 3090, 4090, etc.)
  • Need hot-swappable adapters for multi-task serving
  • Combining with quantization for extreme efficiency (QLoRA)

Quick Start

pip install peft transformers accelerate bitsandbytes
from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto", ) # Add LoRA adapter lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # Rank — size of update matrices lora_alpha=32, # Scaling factor lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 13,631,488 || all params: 8,030,261,248 || trainable%: 0.1698%

PEFT Methods

LoRA (Low-Rank Adaptation)

from peft import LoraConfig config = LoraConfig( r=16, # Rank of update matrices lora_alpha=32, # Scaling factor (alpha/r) target_modules="all-linear", # Apply to all linear layers lora_dropout=0.05, bias="none", # Don't train biases task_type="CAUSAL_LM", ) # How LoRA works: # Original: Y = W·X (W is frozen, d×d) # LoRA: Y = W·X + BA·X (A is d×r, B is r×d, only A,B are trained) # Memory: 2×d×r << d×d (r=16 << d=4096)

QLoRA (Quantized LoRA)

from transformers import BitsAndBytesConfig from peft import LoraConfig, prepare_model_for_kbit_training # Load model in 4-bit bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3-8B-Instruct", quantization_config=bnb_config, device_map="auto", ) # Prepare for training model = prepare_model_for_kbit_training(model) # Add LoRA on top of 4-bit model lora_config = LoraConfig( r=32, lora_alpha=64, target_modules="all-linear", lora_dropout=0.05, ) model = get_peft_model(model, lora_config)

Training Loop

from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-4, # Higher LR for LoRA than full fine-tuning bf16=True, logging_steps=10, save_steps=500, evaluation_strategy="steps", eval_steps=500, lr_scheduler_type="cosine", warmup_ratio=0.1, optim="paged_adamw_8bit", # Memory-efficient optimizer ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, ) trainer.train() # Save adapter (small, ~50MB) model.save_pretrained("./lora-adapter")

PEFT Method Comparison

MethodTrainable ParamsMemoryQualityUse Case
LoRA0.1-1%LowExcellentDefault choice
QLoRA0.1-1%Very lowVery goodConsumer GPUs
Prefix Tuning<0.1%Very lowGoodShort tasks
Prompt Tuning<0.01%MinimalFairFew-shot adaptation
IA3<0.01%MinimalGoodMulti-task
Full Fine-Tune100%Very highBestWhen you have resources

Adapter Management

from peft import PeftModel # Load base model + adapter base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B") model = PeftModel.from_pretrained(base_model, "./lora-adapter") # Merge adapter into base model (for deployment) merged_model = model.merge_and_unload() merged_model.save_pretrained("./merged-model") # Hot-swap adapters model.load_adapter("./adapter-task1", adapter_name="task1") model.load_adapter("./adapter-task2", adapter_name="task2") model.set_adapter("task1") # Switch to task1 adapter model.set_adapter("task2") # Switch to task2 adapter

Configuration Reference

ParameterDefaultDescription
r8LoRA rank (higher = more capacity, more memory)
lora_alpha8Scaling factor (effective scale = alpha/r)
lora_dropout0.0Dropout on LoRA layers
target_modulesNoneWhich layers to apply LoRA to
biasnonenone, all, or lora_only
modules_to_saveNoneFull layers to train (e.g., classifier head)
fan_in_fan_outFalseFor Conv1D layers

Best Practices

  1. Use rank 16-32 — Lower ranks are faster but may lose quality; higher rarely helps
  2. Apply to all linear layerstarget_modules="all-linear" beats selecting specific modules
  3. Set alpha = 2 × rank — Good default scaling ratio
  4. Use higher learning rates — LoRA needs 5-10x higher LR than full fine-tuning (2e-4 vs 2e-5)
  5. Merge for deployment.merge_and_unload() eliminates adapter overhead in production
  6. Keep adapters small — Typical LoRA adapter is 50-200MB vs 16GB+ for full model
  7. Use QLoRA for large models — Enables 70B fine-tuning on a single 48GB GPU
  8. Hot-swap adapters — Load multiple task-specific adapters and switch at runtime
  9. Save adapter separately — Only save the adapter, not the full model
  10. Validate with the base model — Ensure improvements are from LoRA, not dataset overlap

Troubleshooting

Loss not decreasing

# Increase rank lora_config = LoraConfig(r=64, lora_alpha=128) # Or apply to more layers lora_config = LoraConfig(target_modules="all-linear") # Or increase learning rate training_args = TrainingArguments(learning_rate=5e-4)

Quality worse than base model

# Reduce training — likely overfitting training_args = TrainingArguments(num_train_epochs=1) # Add dropout lora_config = LoraConfig(lora_dropout=0.1) # Check data quality — bad data = bad model

Merge fails

# Ensure model and adapter are on same device model = model.to("cpu") merged = model.merge_and_unload() # For quantized models, dequantize first
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates