Inference Serving Elite
Production-ready skill that handles serves, llms, high, throughput. Includes structured workflows, validation checks, and reusable patterns for ai research.
Elite LLM Inference Optimization
Overview
A comprehensive skill for maximizing LLM inference performance — covering advanced optimization techniques including KV cache optimization, speculative decoding integration, continuous batching tuning, multi-LoRA serving, disaggregated prefill/decode, and cost-per-token analysis. Designed for production environments serving millions of requests with strict latency and cost requirements.
When to Use
- Optimizing production LLM serving for cost and latency
- Need sub-100ms time-to-first-token (TTFT)
- Serving multiple LoRA adapters efficiently
- Implementing disaggregated prefill and decode
- Cost optimization for high-volume inference
- A/B testing model variants in production
Quick Start
# vLLM with optimizations vllm serve meta-llama/Llama-3-8B-Instruct \ --enable-prefix-caching \ --enable-chunked-prefill \ --max-num-seqs 512 \ --gpu-memory-utilization 0.92 \ --quantization awq # Multi-LoRA serving vllm serve meta-llama/Llama-3-8B-Instruct \ --enable-lora \ --lora-modules task1=./lora-task1 task2=./lora-task2 \ --max-loras 4
KV Cache Optimization
# PagedAttention — dynamic KV cache allocation # Instead of pre-allocating max_seq_len per request, # allocate pages on demand (like virtual memory) # vLLM handles this automatically # Key parameters to tune: # Block size — smaller = less waste, more overhead # Default: 16 tokens per block --block-size 16 # Cache utilization — how much GPU memory for KV cache --gpu-memory-utilization 0.92 # Use 92% of VRAM # Prefix caching — reuse KV cache for shared prefixes --enable-prefix-caching
KV Cache Memory Calculator
def calculate_kv_cache_memory( num_layers: int, num_kv_heads: int, head_dim: int, max_seq_len: int, max_batch_size: int, dtype_bytes: int = 2, # bf16 ) -> float: """Calculate KV cache memory in GB""" per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes # 2 for K+V total = per_token * max_seq_len * max_batch_size return total / (1024 ** 3) # Llama-3 8B: 32 layers, 8 KV heads, 128 dim kv_gb = calculate_kv_cache_memory(32, 8, 128, 8192, 256) print(f"KV cache: {kv_gb:.1f} GB") # ~16.8 GB
Multi-LoRA Serving
# Serve multiple task-specific LoRA adapters # from a single base model instance # vLLM approach from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest llm = LLM( model="meta-llama/Llama-3-8B-Instruct", enable_lora=True, max_loras=4, # Max concurrent LoRA adapters max_lora_rank=32, ) # Request with specific LoRA output = llm.generate( "Translate to French: Hello world", SamplingParams(max_tokens=50), lora_request=LoRARequest("translator", 1, "./lora-translate"), ) # Different request with different LoRA output = llm.generate( "Summarize this document...", SamplingParams(max_tokens=200), lora_request=LoRARequest("summarizer", 2, "./lora-summarize"), )
Cost Optimization
class CostTracker: def __init__(self): self.gpu_cost_per_hour = 2.50 # e.g., A100 spot pricing self.requests_served = 0 self.total_tokens = 0 self.gpu_hours = 0 def log_request(self, input_tokens, output_tokens, latency_ms): self.requests_served += 1 self.total_tokens += input_tokens + output_tokens self.gpu_hours += latency_ms / (1000 * 3600) @property def cost_per_1k_tokens(self): if self.total_tokens == 0: return 0 return (self.gpu_hours * self.gpu_cost_per_hour) / (self.total_tokens / 1000) @property def cost_per_request(self): if self.requests_served == 0: return 0 return (self.gpu_hours * self.gpu_cost_per_hour) / self.requests_served
Performance Tuning Reference
| Optimization | Impact | Effort | Trade-off |
|---|---|---|---|
| Continuous Batching | 5-10x throughput | Built-in | None |
| Prefix Caching | 2-5x for shared prefixes | Config flag | Memory |
| AWQ Quantization | 2x throughput | One-time quant | <1% quality |
| Tensor Parallelism | Linear scale per GPU | Config flag | Inter-GPU comm |
| Chunked Prefill | Better tail latency | Config flag | Slight throughput |
| Multi-LoRA | Serve N tasks, 1 GPU | Config flag | Memory per adapter |
| Speculative Decoding | 2-3x faster generation | Draft model needed | Memory for draft |
Best Practices
- Measure everything — Track TTFT, TPS, P50/P95/P99 latency, GPU utilization
- Use prefix caching for chatbots — System prompts are repeated across all requests
- Right-size your context window — Don't set max_model_len higher than needed
- Quantize aggressively — AWQ 4-bit gives 2x throughput with minimal quality impact
- Use continuous batching — Never use static batching in production
- Implement request timeouts — Kill requests that exceed SLA
- Monitor queue depth — Growing queue = need more replicas
- Use streaming — Lower perceived latency for users
- Pre-warm the model — Send dummy requests after startup
- Track cost per request — Optimize for cost-per-useful-token, not raw throughput
Troubleshooting
P99 latency spikes
# Enable chunked prefill to prevent long-prompt preemption --enable-chunked-prefill # Reduce max batch size --max-num-seqs 128 # Check for GC pauses import gc; gc.disable() # In long-running server processes
GPU utilization drops during inference
# Increase max concurrent sequences --max-num-seqs 512 # Check data pipeline — are requests arriving fast enough? # Enable prefix caching to reduce redundant compute --enable-prefix-caching
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.