Smart Emerging ML Techniques

Overview

A comprehensive skill for intelligently selecting and applying emerging machine learning techniques based on specific use cases and constraints. Covers an adaptive framework for choosing between knowledge distillation, quantization, pruning, model merging, mixture of experts, and other optimization techniques — with decision trees, benchmarks, and implementation guidance for each approach.

When to Use

Deciding which optimization technique to apply
Need to reduce model size for production deployment
Optimizing inference latency or throughput
Balancing model quality against compute/memory constraints
Evaluating tradeoffs between techniques
Building optimization pipelines that combine multiple techniques

Quick Start


# Analyze your model to determine best optimization path
pip install transformers torch optimum auto-gptq bitsandbytes

# Quick model analysis
python -c "
from transformers import AutoModel
model = AutoModel.from_pretrained('your-model')
params = sum(p.numel() for p in model.parameters())
print(f'Parameters: {params/1e9:.1f}B')
print(f'FP16 size: {params * 2 / 1e9:.1f}GB')
print(f'INT8 size: {params / 1e9:.1f}GB')
print(f'INT4 size: {params * 0.5 / 1e9:.1f}GB')
"

Decision Framework

When to Use Each Technique

Need to reduce model size?
├── Yes → How much compression needed?
│   ├── 2x → Quantization (INT8/FP8)
│   ├── 4x → Quantization (INT4/GPTQ/AWQ)
│   ├── 10x → Knowledge Distillation
│   └── 50x+ → Distillation + Quantization + Pruning
│
Need faster inference?
├── Yes → What's the bottleneck?
│   ├── Compute → Speculative Decoding (2-3x speedup)
│   ├── Memory → Quantization + KV Cache optimization
│   └── Latency → Smaller model or distillation
│
Need better quality?
├── Yes → What resources are available?
│   ├── Multiple models → Model Merging (SLERP/TIES)
│   ├── Large teacher → Knowledge Distillation
│   └── Unlabeled data → Self-training / RLHF

Technique Compatibility Matrix

Technique	+ Quantization	+ Pruning	+ Distillation	+ Merging
Quantization	—	Good	Excellent	Good
Pruning	Good	—	Good	Poor
Distillation	Excellent	Good	—	N/A
Merging	Good	Poor	N/A	—
MoE	Good	Fair	Good	Poor
Speculative	Excellent	N/A	Good	N/A

Implementation Patterns

Optimization Pipeline


class ModelOptimizer:
    def __init__(self, model_name):
        self.model_name = model_name
        self.pipeline = []
    
    def add_step(self, technique, config):
        self.pipeline.append((technique, config))
        return self
    
    def optimize(self):
        model = load_model(self.model_name)
        
        for technique, config in self.pipeline:
            if technique == 'distill':
                model = self.distill(model, config)
            elif technique == 'quantize':
                model = self.quantize(model, config)
            elif technique == 'prune':
                model = self.prune(model, config)
            
            # Validate after each step
            score = self.evaluate(model)
            print(f"After {technique}: quality={score:.3f}")
            if score < config.get('min_quality', 0.8):
                raise ValueError(f"Quality dropped below threshold after {technique}")
        
        return model

# Usage
optimizer = ModelOptimizer("meta-llama/Llama-3-70B")
optimizer.add_step('distill', {'teacher': '70B', 'student': '7B', 'min_quality': 0.9})
optimizer.add_step('quantize', {'bits': 4, 'method': 'awq', 'min_quality': 0.85})
optimized = optimizer.optimize()

Benchmark Framework


import time
import torch

def benchmark_model(model, tokenizer, prompts, metrics=['latency', 'throughput', 'quality']):
    results = {}
    
    # Latency (time to first token)
    if 'latency' in metrics:
        start = time.perf_counter()
        model.generate(tokenizer.encode(prompts[0], return_tensors='pt'), max_new_tokens=1)
        results['ttft_ms'] = (time.perf_counter() - start) * 1000
    
    # Throughput (tokens per second)
    if 'throughput' in metrics:
        start = time.perf_counter()
        total_tokens = 0
        for prompt in prompts[:10]:
            out = model.generate(tokenizer.encode(prompt, return_tensors='pt'), max_new_tokens=100)
            total_tokens += out.shape[-1]
        elapsed = time.perf_counter() - start
        results['tokens_per_sec'] = total_tokens / elapsed
    
    # Memory
    results['gpu_memory_gb'] = torch.cuda.max_memory_allocated() / 1e9
    results['model_params_b'] = sum(p.numel() for p in model.parameters()) / 1e9
    
    return results

Performance Reference

Technique	Speed Gain	Size Reduction	Quality Impact	Effort
INT8 Quantization	1.5-2x	2x	<1% loss	Low
INT4 Quantization	2-3x	4x	1-3% loss	Low
Knowledge Distillation	Variable	10-50x	2-10% loss	High
Structured Pruning	1.5-3x	2-5x	2-5% loss	Medium
Speculative Decoding	2-3x	None	0% loss	Medium
Model Merging	None	None	Varies	Low

Best Practices

Measure before optimizing — Profile your model to identify actual bottlenecks
Set quality thresholds upfront — Define minimum acceptable quality before starting
Optimize incrementally — Apply one technique at a time, measure, then stack
Use task-specific evaluation — Generic benchmarks may not reflect your use case
Consider the full pipeline — Include tokenization, preprocessing, and postprocessing in benchmarks
Test on representative data — Edge cases often reveal quality degradation
Automate benchmarking — Run evals automatically after each optimization step
Document tradeoffs — Record quality/speed/memory for each configuration
Keep the original model — Always maintain an unoptimized baseline for comparison
Monitor in production — Quality may degrade differently on real traffic vs benchmarks

Troubleshooting

Quality dropped more than expected after quantization


# Try calibration with representative data
from auto_gptq import AutoGPTQForCausalLM

quantized = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config,
)
# Calibrate with diverse, representative data
quantized.quantize(calibration_dataset)

Combined techniques cause cascading quality loss


# Evaluate quality after EACH step, not just at the end
for i, (technique, config) in enumerate(pipeline):
    model = apply_technique(model, technique, config)
    score = evaluate(model, eval_dataset)
    print(f"Step {i} ({technique}): score={score:.4f}")
    if score < threshold:
        print(f"STOP: Quality below threshold at step {i}")
        model = rollback(i - 1)  # Revert to previous step
        break

⚠️ Loading Issue

Emerging Techniques Smart