Emerging Techniques Smart
All-in-one skill covering merge, multiple, fine, tuned. Includes structured workflows, validation checks, and reusable patterns for ai research.
Smart Emerging ML Techniques
Overview
A comprehensive skill for intelligently selecting and applying emerging machine learning techniques based on specific use cases and constraints. Covers an adaptive framework for choosing between knowledge distillation, quantization, pruning, model merging, mixture of experts, and other optimization techniques ā with decision trees, benchmarks, and implementation guidance for each approach.
When to Use
- Deciding which optimization technique to apply
- Need to reduce model size for production deployment
- Optimizing inference latency or throughput
- Balancing model quality against compute/memory constraints
- Evaluating tradeoffs between techniques
- Building optimization pipelines that combine multiple techniques
Quick Start
# Analyze your model to determine best optimization path pip install transformers torch optimum auto-gptq bitsandbytes # Quick model analysis python -c " from transformers import AutoModel model = AutoModel.from_pretrained('your-model') params = sum(p.numel() for p in model.parameters()) print(f'Parameters: {params/1e9:.1f}B') print(f'FP16 size: {params * 2 / 1e9:.1f}GB') print(f'INT8 size: {params / 1e9:.1f}GB') print(f'INT4 size: {params * 0.5 / 1e9:.1f}GB') "
Decision Framework
When to Use Each Technique
Need to reduce model size?
āāā Yes ā How much compression needed?
ā āāā 2x ā Quantization (INT8/FP8)
ā āāā 4x ā Quantization (INT4/GPTQ/AWQ)
ā āāā 10x ā Knowledge Distillation
ā āāā 50x+ ā Distillation + Quantization + Pruning
ā
Need faster inference?
āāā Yes ā What's the bottleneck?
ā āāā Compute ā Speculative Decoding (2-3x speedup)
ā āāā Memory ā Quantization + KV Cache optimization
ā āāā Latency ā Smaller model or distillation
ā
Need better quality?
āāā Yes ā What resources are available?
ā āāā Multiple models ā Model Merging (SLERP/TIES)
ā āāā Large teacher ā Knowledge Distillation
ā āāā Unlabeled data ā Self-training / RLHF
Technique Compatibility Matrix
| Technique | + Quantization | + Pruning | + Distillation | + Merging |
|---|---|---|---|---|
| Quantization | ā | Good | Excellent | Good |
| Pruning | Good | ā | Good | Poor |
| Distillation | Excellent | Good | ā | N/A |
| Merging | Good | Poor | N/A | ā |
| MoE | Good | Fair | Good | Poor |
| Speculative | Excellent | N/A | Good | N/A |
Implementation Patterns
Optimization Pipeline
class ModelOptimizer: def __init__(self, model_name): self.model_name = model_name self.pipeline = [] def add_step(self, technique, config): self.pipeline.append((technique, config)) return self def optimize(self): model = load_model(self.model_name) for technique, config in self.pipeline: if technique == 'distill': model = self.distill(model, config) elif technique == 'quantize': model = self.quantize(model, config) elif technique == 'prune': model = self.prune(model, config) # Validate after each step score = self.evaluate(model) print(f"After {technique}: quality={score:.3f}") if score < config.get('min_quality', 0.8): raise ValueError(f"Quality dropped below threshold after {technique}") return model # Usage optimizer = ModelOptimizer("meta-llama/Llama-3-70B") optimizer.add_step('distill', {'teacher': '70B', 'student': '7B', 'min_quality': 0.9}) optimizer.add_step('quantize', {'bits': 4, 'method': 'awq', 'min_quality': 0.85}) optimized = optimizer.optimize()
Benchmark Framework
import time import torch def benchmark_model(model, tokenizer, prompts, metrics=['latency', 'throughput', 'quality']): results = {} # Latency (time to first token) if 'latency' in metrics: start = time.perf_counter() model.generate(tokenizer.encode(prompts[0], return_tensors='pt'), max_new_tokens=1) results['ttft_ms'] = (time.perf_counter() - start) * 1000 # Throughput (tokens per second) if 'throughput' in metrics: start = time.perf_counter() total_tokens = 0 for prompt in prompts[:10]: out = model.generate(tokenizer.encode(prompt, return_tensors='pt'), max_new_tokens=100) total_tokens += out.shape[-1] elapsed = time.perf_counter() - start results['tokens_per_sec'] = total_tokens / elapsed # Memory results['gpu_memory_gb'] = torch.cuda.max_memory_allocated() / 1e9 results['model_params_b'] = sum(p.numel() for p in model.parameters()) / 1e9 return results
Performance Reference
| Technique | Speed Gain | Size Reduction | Quality Impact | Effort |
|---|---|---|---|---|
| INT8 Quantization | 1.5-2x | 2x | <1% loss | Low |
| INT4 Quantization | 2-3x | 4x | 1-3% loss | Low |
| Knowledge Distillation | Variable | 10-50x | 2-10% loss | High |
| Structured Pruning | 1.5-3x | 2-5x | 2-5% loss | Medium |
| Speculative Decoding | 2-3x | None | 0% loss | Medium |
| Model Merging | None | None | Varies | Low |
Best Practices
- Measure before optimizing ā Profile your model to identify actual bottlenecks
- Set quality thresholds upfront ā Define minimum acceptable quality before starting
- Optimize incrementally ā Apply one technique at a time, measure, then stack
- Use task-specific evaluation ā Generic benchmarks may not reflect your use case
- Consider the full pipeline ā Include tokenization, preprocessing, and postprocessing in benchmarks
- Test on representative data ā Edge cases often reveal quality degradation
- Automate benchmarking ā Run evals automatically after each optimization step
- Document tradeoffs ā Record quality/speed/memory for each configuration
- Keep the original model ā Always maintain an unoptimized baseline for comparison
- Monitor in production ā Quality may degrade differently on real traffic vs benchmarks
Troubleshooting
Quality dropped more than expected after quantization
# Try calibration with representative data from auto_gptq import AutoGPTQForCausalLM quantized = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config, ) # Calibrate with diverse, representative data quantized.quantize(calibration_dataset)
Combined techniques cause cascading quality loss
# Evaluate quality after EACH step, not just at the end for i, (technique, config) in enumerate(pipeline): model = apply_technique(model, technique, config) score = evaluate(model, eval_dataset) print(f"Step {i} ({technique}): score={score:.4f}") if score < threshold: print(f"STOP: Quality below threshold at step {i}") model = rollback(i - 1) # Revert to previous step break
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.