Knowledge Distillation for LLMs

Overview

A comprehensive skill for compressing large language models through knowledge distillation — transferring capabilities from large "teacher" models to smaller "student" models while retaining 90%+ performance. Covers temperature scaling, logit distillation, response-based distillation, feature-based distillation, and synthetic data generation approaches for creating efficient, deployable models.

When to Use

Compressing large models (70B → 7B) for production deployment
Reducing inference costs while maintaining quality
Transferring proprietary model capabilities to open-source models
Creating domain-specific smaller models
Distilling task-specific knowledge from general-purpose models
Building on-device models from cloud-scale teachers

Quick Start


# Install dependencies
pip install transformers torch datasets

# Basic distillation setup
pip install distillation-toolkit  # or use HuggingFace


import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.5):
    """Combined distillation + task loss"""
    # Soft targets from teacher (knowledge transfer)
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=-1),
        F.softmax(teacher_logits / temperature, dim=-1),
        reduction='batchmean',
    ) * (temperature ** 2)
    
    # Hard targets (standard cross-entropy)
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha * soft_loss + (1 - alpha) * hard_loss

Distillation Methods

1. Response-Based (Logit Distillation)


class LogitDistiller:
    def __init__(self, teacher, student, temperature=3.0, alpha=0.5):
        self.teacher = teacher.eval()
        self.student = student
        self.temperature = temperature
        self.alpha = alpha
    
    def train_step(self, batch):
        inputs = batch['input_ids']
        labels = batch['labels']
        
        # Get teacher predictions (no gradient needed)
        with torch.no_grad():
            teacher_logits = self.teacher(inputs).logits
        
        # Get student predictions
        student_logits = self.student(inputs).logits
        
        # Combined loss
        loss = distillation_loss(
            student_logits, teacher_logits, labels,
            self.temperature, self.alpha
        )
        return loss

2. Feature-Based Distillation


class FeatureDistiller:
    """Match intermediate representations between teacher and student"""
    def __init__(self, teacher, student, layer_mapping):
        self.teacher = teacher.eval()
        self.student = student
        self.layer_mapping = layer_mapping  # {student_layer: teacher_layer}
        self.projectors = {}
        
        for s_layer, t_layer in layer_mapping.items():
            s_dim = student.config.hidden_size
            t_dim = teacher.config.hidden_size
            if s_dim != t_dim:
                self.projectors[s_layer] = torch.nn.Linear(s_dim, t_dim)
    
    def feature_loss(self, student_hidden, teacher_hidden):
        return F.mse_loss(student_hidden, teacher_hidden)

3. Synthetic Data Distillation


from transformers import pipeline

def generate_training_data(teacher_model, prompts, num_samples=10000):
    """Use teacher to generate high-quality training data for student"""
    generator = pipeline('text-generation', model=teacher_model)
    
    training_pairs = []
    for prompt in prompts:
        responses = generator(
            prompt,
            max_new_tokens=512,
            num_return_sequences=3,
            temperature=0.7,
            do_sample=True,
        )
        for response in responses:
            training_pairs.append({
                'input': prompt,
                'output': response['generated_text'],
            })
    
    return training_pairs

# Then fine-tune student on this synthetic data
# This is how models like Alpaca and Vicuna were created

Distillation Strategy Comparison

Method	Quality	Cost	Complexity	Best For
Logit distillation	High	High (need teacher forward pass)	Medium	Same-architecture compression
Feature distillation	Very high	Very high	High	Cross-architecture transfer
Synthetic data	Good	Medium (one-time generation)	Low	API-only teachers (GPT-4)
Progressive distillation	Very high	Very high	High	Large compression ratios
Task-specific	Best for task	Low	Low	Single-task deployment

Configuration Reference

Parameter	Range	Description
`temperature`	1.0 - 20.0	Higher = softer probability distribution
`alpha`	0.0 - 1.0	Weight of soft loss vs hard loss
`student_lr`	1e-5 - 1e-3	Student learning rate (usually higher than fine-tuning)
`teacher_model`	-	Larger model providing supervision
`student_model`	-	Smaller model being trained
`num_epochs`	3-10	Training epochs
`batch_size`	8-64	Training batch size
`max_seq_length`	512-4096	Maximum sequence length

Best Practices

Start with response-based distillation — Simplest and most effective for most cases
Use temperature 3-5 — Lower temperatures lose teacher knowledge; higher ones add noise
Generate diverse synthetic data — Vary prompts, temperature, and sampling strategies
Validate on real tasks — Synthetic benchmarks may not reflect real-world performance
Progressive distillation — Distill 70B → 30B → 7B rather than directly 70B → 7B
Keep the student architecture similar — Same tokenizer and similar layer structure works best
Use alpha 0.5 as starting point — Equal weight to soft and hard targets
Evaluate catastrophic forgetting — Student may lose general abilities while gaining specific ones
Combine with quantization — Distill first, then quantize for maximum compression
Use teacher ensembles — Multiple teachers provide more robust supervision

Troubleshooting

Student performance plateaus early


# Try curriculum learning — start with easy examples
# Sort training data by teacher confidence
teacher_confidence = torch.max(F.softmax(teacher_logits, dim=-1), dim=-1).values
sorted_indices = torch.argsort(teacher_confidence, descending=True)
# Train on high-confidence examples first

Temperature too high causes uniform distributions


# Monitor KL divergence during training
kl_div = F.kl_div(
    F.log_softmax(student_logits / temp, dim=-1),
    F.softmax(teacher_logits / temp, dim=-1),
    reduction='batchmean',
)
# If KL is near 0, lower the temperature
if kl_div < 0.01:
    temperature *= 0.9

Out of memory with large teacher


# Use gradient checkpointing for teacher forward pass
# Or use offline distillation: pre-compute teacher outputs
teacher_outputs = {}
with torch.no_grad():
    for i, batch in enumerate(dataloader):
        teacher_outputs[i] = teacher(batch).logits.cpu()
# Then train student using stored outputs

⚠️ Loading Issue

Emerging Techniques Knowledge Kit