E

Emerging Techniques Knowledge Kit

Powerful skill for compress, large, language, models. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Knowledge Distillation for LLMs

Overview

A comprehensive skill for compressing large language models through knowledge distillation — transferring capabilities from large "teacher" models to smaller "student" models while retaining 90%+ performance. Covers temperature scaling, logit distillation, response-based distillation, feature-based distillation, and synthetic data generation approaches for creating efficient, deployable models.

When to Use

  • Compressing large models (70B → 7B) for production deployment
  • Reducing inference costs while maintaining quality
  • Transferring proprietary model capabilities to open-source models
  • Creating domain-specific smaller models
  • Distilling task-specific knowledge from general-purpose models
  • Building on-device models from cloud-scale teachers

Quick Start

# Install dependencies pip install transformers torch datasets # Basic distillation setup pip install distillation-toolkit # or use HuggingFace
import torch import torch.nn.functional as F def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.5): """Combined distillation + task loss""" # Soft targets from teacher (knowledge transfer) soft_loss = F.kl_div( F.log_softmax(student_logits / temperature, dim=-1), F.softmax(teacher_logits / temperature, dim=-1), reduction='batchmean', ) * (temperature ** 2) # Hard targets (standard cross-entropy) hard_loss = F.cross_entropy(student_logits, labels) return alpha * soft_loss + (1 - alpha) * hard_loss

Distillation Methods

1. Response-Based (Logit Distillation)

class LogitDistiller: def __init__(self, teacher, student, temperature=3.0, alpha=0.5): self.teacher = teacher.eval() self.student = student self.temperature = temperature self.alpha = alpha def train_step(self, batch): inputs = batch['input_ids'] labels = batch['labels'] # Get teacher predictions (no gradient needed) with torch.no_grad(): teacher_logits = self.teacher(inputs).logits # Get student predictions student_logits = self.student(inputs).logits # Combined loss loss = distillation_loss( student_logits, teacher_logits, labels, self.temperature, self.alpha ) return loss

2. Feature-Based Distillation

class FeatureDistiller: """Match intermediate representations between teacher and student""" def __init__(self, teacher, student, layer_mapping): self.teacher = teacher.eval() self.student = student self.layer_mapping = layer_mapping # {student_layer: teacher_layer} self.projectors = {} for s_layer, t_layer in layer_mapping.items(): s_dim = student.config.hidden_size t_dim = teacher.config.hidden_size if s_dim != t_dim: self.projectors[s_layer] = torch.nn.Linear(s_dim, t_dim) def feature_loss(self, student_hidden, teacher_hidden): return F.mse_loss(student_hidden, teacher_hidden)

3. Synthetic Data Distillation

from transformers import pipeline def generate_training_data(teacher_model, prompts, num_samples=10000): """Use teacher to generate high-quality training data for student""" generator = pipeline('text-generation', model=teacher_model) training_pairs = [] for prompt in prompts: responses = generator( prompt, max_new_tokens=512, num_return_sequences=3, temperature=0.7, do_sample=True, ) for response in responses: training_pairs.append({ 'input': prompt, 'output': response['generated_text'], }) return training_pairs # Then fine-tune student on this synthetic data # This is how models like Alpaca and Vicuna were created

Distillation Strategy Comparison

MethodQualityCostComplexityBest For
Logit distillationHighHigh (need teacher forward pass)MediumSame-architecture compression
Feature distillationVery highVery highHighCross-architecture transfer
Synthetic dataGoodMedium (one-time generation)LowAPI-only teachers (GPT-4)
Progressive distillationVery highVery highHighLarge compression ratios
Task-specificBest for taskLowLowSingle-task deployment

Configuration Reference

ParameterRangeDescription
temperature1.0 - 20.0Higher = softer probability distribution
alpha0.0 - 1.0Weight of soft loss vs hard loss
student_lr1e-5 - 1e-3Student learning rate (usually higher than fine-tuning)
teacher_model-Larger model providing supervision
student_model-Smaller model being trained
num_epochs3-10Training epochs
batch_size8-64Training batch size
max_seq_length512-4096Maximum sequence length

Best Practices

  1. Start with response-based distillation — Simplest and most effective for most cases
  2. Use temperature 3-5 — Lower temperatures lose teacher knowledge; higher ones add noise
  3. Generate diverse synthetic data — Vary prompts, temperature, and sampling strategies
  4. Validate on real tasks — Synthetic benchmarks may not reflect real-world performance
  5. Progressive distillation — Distill 70B → 30B → 7B rather than directly 70B → 7B
  6. Keep the student architecture similar — Same tokenizer and similar layer structure works best
  7. Use alpha 0.5 as starting point — Equal weight to soft and hard targets
  8. Evaluate catastrophic forgetting — Student may lose general abilities while gaining specific ones
  9. Combine with quantization — Distill first, then quantize for maximum compression
  10. Use teacher ensembles — Multiple teachers provide more robust supervision

Troubleshooting

Student performance plateaus early

# Try curriculum learning — start with easy examples # Sort training data by teacher confidence teacher_confidence = torch.max(F.softmax(teacher_logits, dim=-1), dim=-1).values sorted_indices = torch.argsort(teacher_confidence, descending=True) # Train on high-confidence examples first

Temperature too high causes uniform distributions

# Monitor KL divergence during training kl_div = F.kl_div( F.log_softmax(student_logits / temp, dim=-1), F.softmax(teacher_logits / temp, dim=-1), reduction='batchmean', ) # If KL is near 0, lower the temperature if kl_div < 0.01: temperature *= 0.9

Out of memory with large teacher

# Use gradient checkpointing for teacher forward pass # Or use offline distillation: pre-compute teacher outputs teacher_outputs = {} with torch.no_grad(): for i, batch in enumerate(dataloader): teacher_outputs[i] = teacher(batch).logits.cpu() # Then train student using stored outputs
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates