Emerging Techniques Knowledge Kit
Powerful skill for compress, large, language, models. Includes structured workflows, validation checks, and reusable patterns for ai research.
Knowledge Distillation for LLMs
Overview
A comprehensive skill for compressing large language models through knowledge distillation — transferring capabilities from large "teacher" models to smaller "student" models while retaining 90%+ performance. Covers temperature scaling, logit distillation, response-based distillation, feature-based distillation, and synthetic data generation approaches for creating efficient, deployable models.
When to Use
- Compressing large models (70B → 7B) for production deployment
- Reducing inference costs while maintaining quality
- Transferring proprietary model capabilities to open-source models
- Creating domain-specific smaller models
- Distilling task-specific knowledge from general-purpose models
- Building on-device models from cloud-scale teachers
Quick Start
# Install dependencies pip install transformers torch datasets # Basic distillation setup pip install distillation-toolkit # or use HuggingFace
import torch import torch.nn.functional as F def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.5): """Combined distillation + task loss""" # Soft targets from teacher (knowledge transfer) soft_loss = F.kl_div( F.log_softmax(student_logits / temperature, dim=-1), F.softmax(teacher_logits / temperature, dim=-1), reduction='batchmean', ) * (temperature ** 2) # Hard targets (standard cross-entropy) hard_loss = F.cross_entropy(student_logits, labels) return alpha * soft_loss + (1 - alpha) * hard_loss
Distillation Methods
1. Response-Based (Logit Distillation)
class LogitDistiller: def __init__(self, teacher, student, temperature=3.0, alpha=0.5): self.teacher = teacher.eval() self.student = student self.temperature = temperature self.alpha = alpha def train_step(self, batch): inputs = batch['input_ids'] labels = batch['labels'] # Get teacher predictions (no gradient needed) with torch.no_grad(): teacher_logits = self.teacher(inputs).logits # Get student predictions student_logits = self.student(inputs).logits # Combined loss loss = distillation_loss( student_logits, teacher_logits, labels, self.temperature, self.alpha ) return loss
2. Feature-Based Distillation
class FeatureDistiller: """Match intermediate representations between teacher and student""" def __init__(self, teacher, student, layer_mapping): self.teacher = teacher.eval() self.student = student self.layer_mapping = layer_mapping # {student_layer: teacher_layer} self.projectors = {} for s_layer, t_layer in layer_mapping.items(): s_dim = student.config.hidden_size t_dim = teacher.config.hidden_size if s_dim != t_dim: self.projectors[s_layer] = torch.nn.Linear(s_dim, t_dim) def feature_loss(self, student_hidden, teacher_hidden): return F.mse_loss(student_hidden, teacher_hidden)
3. Synthetic Data Distillation
from transformers import pipeline def generate_training_data(teacher_model, prompts, num_samples=10000): """Use teacher to generate high-quality training data for student""" generator = pipeline('text-generation', model=teacher_model) training_pairs = [] for prompt in prompts: responses = generator( prompt, max_new_tokens=512, num_return_sequences=3, temperature=0.7, do_sample=True, ) for response in responses: training_pairs.append({ 'input': prompt, 'output': response['generated_text'], }) return training_pairs # Then fine-tune student on this synthetic data # This is how models like Alpaca and Vicuna were created
Distillation Strategy Comparison
| Method | Quality | Cost | Complexity | Best For |
|---|---|---|---|---|
| Logit distillation | High | High (need teacher forward pass) | Medium | Same-architecture compression |
| Feature distillation | Very high | Very high | High | Cross-architecture transfer |
| Synthetic data | Good | Medium (one-time generation) | Low | API-only teachers (GPT-4) |
| Progressive distillation | Very high | Very high | High | Large compression ratios |
| Task-specific | Best for task | Low | Low | Single-task deployment |
Configuration Reference
| Parameter | Range | Description |
|---|---|---|
temperature | 1.0 - 20.0 | Higher = softer probability distribution |
alpha | 0.0 - 1.0 | Weight of soft loss vs hard loss |
student_lr | 1e-5 - 1e-3 | Student learning rate (usually higher than fine-tuning) |
teacher_model | - | Larger model providing supervision |
student_model | - | Smaller model being trained |
num_epochs | 3-10 | Training epochs |
batch_size | 8-64 | Training batch size |
max_seq_length | 512-4096 | Maximum sequence length |
Best Practices
- Start with response-based distillation — Simplest and most effective for most cases
- Use temperature 3-5 — Lower temperatures lose teacher knowledge; higher ones add noise
- Generate diverse synthetic data — Vary prompts, temperature, and sampling strategies
- Validate on real tasks — Synthetic benchmarks may not reflect real-world performance
- Progressive distillation — Distill 70B → 30B → 7B rather than directly 70B → 7B
- Keep the student architecture similar — Same tokenizer and similar layer structure works best
- Use alpha 0.5 as starting point — Equal weight to soft and hard targets
- Evaluate catastrophic forgetting — Student may lose general abilities while gaining specific ones
- Combine with quantization — Distill first, then quantize for maximum compression
- Use teacher ensembles — Multiple teachers provide more robust supervision
Troubleshooting
Student performance plateaus early
# Try curriculum learning — start with easy examples # Sort training data by teacher confidence teacher_confidence = torch.max(F.softmax(teacher_logits, dim=-1), dim=-1).values sorted_indices = torch.argsort(teacher_confidence, descending=True) # Train on high-confidence examples first
Temperature too high causes uniform distributions
# Monitor KL divergence during training kl_div = F.kl_div( F.log_softmax(student_logits / temp, dim=-1), F.softmax(teacher_logits / temp, dim=-1), reduction='batchmean', ) # If KL is near 0, lower the temperature if kl_div < 0.01: temperature *= 0.9
Out of memory with large teacher
# Use gradient checkpointing for teacher forward pass # Or use offline distillation: pre-compute teacher outputs teacher_outputs = {} with torch.no_grad(): for i, batch in enumerate(dataloader): teacher_outputs[i] = teacher(batch).logits.cpu() # Then train student using stored outputs
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.