I

Inference Serving Elite

Production-ready skill that handles serves, llms, high, throughput. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Elite LLM Inference Optimization

Overview

A comprehensive skill for maximizing LLM inference performance — covering advanced optimization techniques including KV cache optimization, speculative decoding integration, continuous batching tuning, multi-LoRA serving, disaggregated prefill/decode, and cost-per-token analysis. Designed for production environments serving millions of requests with strict latency and cost requirements.

When to Use

  • Optimizing production LLM serving for cost and latency
  • Need sub-100ms time-to-first-token (TTFT)
  • Serving multiple LoRA adapters efficiently
  • Implementing disaggregated prefill and decode
  • Cost optimization for high-volume inference
  • A/B testing model variants in production

Quick Start

# vLLM with optimizations vllm serve meta-llama/Llama-3-8B-Instruct \ --enable-prefix-caching \ --enable-chunked-prefill \ --max-num-seqs 512 \ --gpu-memory-utilization 0.92 \ --quantization awq # Multi-LoRA serving vllm serve meta-llama/Llama-3-8B-Instruct \ --enable-lora \ --lora-modules task1=./lora-task1 task2=./lora-task2 \ --max-loras 4

KV Cache Optimization

# PagedAttention — dynamic KV cache allocation # Instead of pre-allocating max_seq_len per request, # allocate pages on demand (like virtual memory) # vLLM handles this automatically # Key parameters to tune: # Block size — smaller = less waste, more overhead # Default: 16 tokens per block --block-size 16 # Cache utilization — how much GPU memory for KV cache --gpu-memory-utilization 0.92 # Use 92% of VRAM # Prefix caching — reuse KV cache for shared prefixes --enable-prefix-caching

KV Cache Memory Calculator

def calculate_kv_cache_memory( num_layers: int, num_kv_heads: int, head_dim: int, max_seq_len: int, max_batch_size: int, dtype_bytes: int = 2, # bf16 ) -> float: """Calculate KV cache memory in GB""" per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes # 2 for K+V total = per_token * max_seq_len * max_batch_size return total / (1024 ** 3) # Llama-3 8B: 32 layers, 8 KV heads, 128 dim kv_gb = calculate_kv_cache_memory(32, 8, 128, 8192, 256) print(f"KV cache: {kv_gb:.1f} GB") # ~16.8 GB

Multi-LoRA Serving

# Serve multiple task-specific LoRA adapters # from a single base model instance # vLLM approach from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest llm = LLM( model="meta-llama/Llama-3-8B-Instruct", enable_lora=True, max_loras=4, # Max concurrent LoRA adapters max_lora_rank=32, ) # Request with specific LoRA output = llm.generate( "Translate to French: Hello world", SamplingParams(max_tokens=50), lora_request=LoRARequest("translator", 1, "./lora-translate"), ) # Different request with different LoRA output = llm.generate( "Summarize this document...", SamplingParams(max_tokens=200), lora_request=LoRARequest("summarizer", 2, "./lora-summarize"), )

Cost Optimization

class CostTracker: def __init__(self): self.gpu_cost_per_hour = 2.50 # e.g., A100 spot pricing self.requests_served = 0 self.total_tokens = 0 self.gpu_hours = 0 def log_request(self, input_tokens, output_tokens, latency_ms): self.requests_served += 1 self.total_tokens += input_tokens + output_tokens self.gpu_hours += latency_ms / (1000 * 3600) @property def cost_per_1k_tokens(self): if self.total_tokens == 0: return 0 return (self.gpu_hours * self.gpu_cost_per_hour) / (self.total_tokens / 1000) @property def cost_per_request(self): if self.requests_served == 0: return 0 return (self.gpu_hours * self.gpu_cost_per_hour) / self.requests_served

Performance Tuning Reference

OptimizationImpactEffortTrade-off
Continuous Batching5-10x throughputBuilt-inNone
Prefix Caching2-5x for shared prefixesConfig flagMemory
AWQ Quantization2x throughputOne-time quant<1% quality
Tensor ParallelismLinear scale per GPUConfig flagInter-GPU comm
Chunked PrefillBetter tail latencyConfig flagSlight throughput
Multi-LoRAServe N tasks, 1 GPUConfig flagMemory per adapter
Speculative Decoding2-3x faster generationDraft model neededMemory for draft

Best Practices

  1. Measure everything — Track TTFT, TPS, P50/P95/P99 latency, GPU utilization
  2. Use prefix caching for chatbots — System prompts are repeated across all requests
  3. Right-size your context window — Don't set max_model_len higher than needed
  4. Quantize aggressively — AWQ 4-bit gives 2x throughput with minimal quality impact
  5. Use continuous batching — Never use static batching in production
  6. Implement request timeouts — Kill requests that exceed SLA
  7. Monitor queue depth — Growing queue = need more replicas
  8. Use streaming — Lower perceived latency for users
  9. Pre-warm the model — Send dummy requests after startup
  10. Track cost per request — Optimize for cost-per-useful-token, not raw throughput

Troubleshooting

P99 latency spikes

# Enable chunked prefill to prevent long-prompt preemption --enable-chunked-prefill # Reduce max batch size --max-num-seqs 128 # Check for GC pauses import gc; gc.disable() # In long-running server processes

GPU utilization drops during inference

# Increase max concurrent sequences --max-num-seqs 512 # Check data pipeline — are requests arriving fast enough? # Enable prefix caching to reduce redundant compute --enable-prefix-caching
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates