Elite LLM Inference Optimization

Overview

A comprehensive skill for maximizing LLM inference performance — covering advanced optimization techniques including KV cache optimization, speculative decoding integration, continuous batching tuning, multi-LoRA serving, disaggregated prefill/decode, and cost-per-token analysis. Designed for production environments serving millions of requests with strict latency and cost requirements.

When to Use

Optimizing production LLM serving for cost and latency
Need sub-100ms time-to-first-token (TTFT)
Serving multiple LoRA adapters efficiently
Implementing disaggregated prefill and decode
Cost optimization for high-volume inference
A/B testing model variants in production

Quick Start


# vLLM with optimizations
vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 512 \
  --gpu-memory-utilization 0.92 \
  --quantization awq

# Multi-LoRA serving
vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-lora \
  --lora-modules task1=./lora-task1 task2=./lora-task2 \
  --max-loras 4

KV Cache Optimization


# PagedAttention — dynamic KV cache allocation
# Instead of pre-allocating max_seq_len per request,
# allocate pages on demand (like virtual memory)

# vLLM handles this automatically
# Key parameters to tune:

# Block size — smaller = less waste, more overhead
# Default: 16 tokens per block
--block-size 16

# Cache utilization — how much GPU memory for KV cache
--gpu-memory-utilization 0.92  # Use 92% of VRAM

# Prefix caching — reuse KV cache for shared prefixes
--enable-prefix-caching

KV Cache Memory Calculator


def calculate_kv_cache_memory(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    max_seq_len: int,
    max_batch_size: int,
    dtype_bytes: int = 2,  # bf16
) -> float:
    """Calculate KV cache memory in GB"""
    per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes  # 2 for K+V
    total = per_token * max_seq_len * max_batch_size
    return total / (1024 ** 3)

# Llama-3 8B: 32 layers, 8 KV heads, 128 dim
kv_gb = calculate_kv_cache_memory(32, 8, 128, 8192, 256)
print(f"KV cache: {kv_gb:.1f} GB")  # ~16.8 GB

Multi-LoRA Serving


# Serve multiple task-specific LoRA adapters
# from a single base model instance

# vLLM approach
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    enable_lora=True,
    max_loras=4,           # Max concurrent LoRA adapters
    max_lora_rank=32,
)

# Request with specific LoRA
output = llm.generate(
    "Translate to French: Hello world",
    SamplingParams(max_tokens=50),
    lora_request=LoRARequest("translator", 1, "./lora-translate"),
)

# Different request with different LoRA
output = llm.generate(
    "Summarize this document...",
    SamplingParams(max_tokens=200),
    lora_request=LoRARequest("summarizer", 2, "./lora-summarize"),
)

Cost Optimization


class CostTracker:
    def __init__(self):
        self.gpu_cost_per_hour = 2.50  # e.g., A100 spot pricing
        self.requests_served = 0
        self.total_tokens = 0
        self.gpu_hours = 0

    def log_request(self, input_tokens, output_tokens, latency_ms):
        self.requests_served += 1
        self.total_tokens += input_tokens + output_tokens
        self.gpu_hours += latency_ms / (1000 * 3600)

    @property
    def cost_per_1k_tokens(self):
        if self.total_tokens == 0:
            return 0
        return (self.gpu_hours * self.gpu_cost_per_hour) / (self.total_tokens / 1000)

    @property
    def cost_per_request(self):
        if self.requests_served == 0:
            return 0
        return (self.gpu_hours * self.gpu_cost_per_hour) / self.requests_served

Performance Tuning Reference

Optimization	Impact	Effort	Trade-off
Continuous Batching	5-10x throughput	Built-in	None
Prefix Caching	2-5x for shared prefixes	Config flag	Memory
AWQ Quantization	2x throughput	One-time quant	<1% quality
Tensor Parallelism	Linear scale per GPU	Config flag	Inter-GPU comm
Chunked Prefill	Better tail latency	Config flag	Slight throughput
Multi-LoRA	Serve N tasks, 1 GPU	Config flag	Memory per adapter
Speculative Decoding	2-3x faster generation	Draft model needed	Memory for draft

Best Practices

Measure everything — Track TTFT, TPS, P50/P95/P99 latency, GPU utilization
Use prefix caching for chatbots — System prompts are repeated across all requests
Right-size your context window — Don't set max_model_len higher than needed
Quantize aggressively — AWQ 4-bit gives 2x throughput with minimal quality impact
Use continuous batching — Never use static batching in production
Implement request timeouts — Kill requests that exceed SLA
Monitor queue depth — Growing queue = need more replicas
Use streaming — Lower perceived latency for users
Pre-warm the model — Send dummy requests after startup
Track cost per request — Optimize for cost-per-useful-token, not raw throughput

Troubleshooting

P99 latency spikes


# Enable chunked prefill to prevent long-prompt preemption
--enable-chunked-prefill
# Reduce max batch size
--max-num-seqs 128
# Check for GC pauses
import gc; gc.disable()  # In long-running server processes

GPU utilization drops during inference


# Increase max concurrent sequences
--max-num-seqs 512
# Check data pipeline — are requests arriving fast enough?
# Enable prefix caching to reduce redundant compute
--enable-prefix-caching

⚠️ Loading Issue

Inference Serving Elite

Elite LLM Inference Optimization

Overview

When to Use

Quick Start

KV Cache Optimization

KV Cache Memory Calculator

Multi-LoRA Serving

Cost Optimization

Performance Tuning Reference

Best Practices

Troubleshooting

P99 latency spikes

GPU utilization drops during inference

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace