LLM Inference Serving Platform

Overview

A comprehensive skill for deploying and serving LLMs in production — covering high-throughput inference engines (vLLM, TGI, SGLang), API gateway design, load balancing, auto-scaling, model routing, and cost optimization. Enables serving LLMs at scale with sub-100ms latency, continuous batching, and efficient GPU utilization.

When to Use

Deploying LLMs to production
Need high-throughput inference (1000+ req/sec)
Serving multiple models with intelligent routing
Auto-scaling GPU instances based on demand
Optimizing inference costs (GPU, tokens, latency)
Building OpenAI-compatible API endpoints

Quick Start


# vLLM (most popular)
pip install vllm
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000

# Text Generation Inference (TGI)
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3-8B-Instruct

# SGLang (fastest for complex prompts)
pip install sglang
python -m sglang.launch_server --model meta-llama/Llama-3-8B-Instruct --port 30000

Engine Comparison

Feature	vLLM	TGI	SGLang	llama.cpp
Throughput	Very high	High	Highest	Medium
Latency	Low	Low	Lowest	Medium
Continuous Batching	Yes	Yes	Yes	Limited
PagedAttention	Yes	Yes	Yes (RadixAttention)	No
Tensor Parallelism	Yes	Yes	Yes	No
Quantization	AWQ, GPTQ	AWQ, GPTQ, EETQ	AWQ, GPTQ	GGUF (all)
CPU Support	No	No	No	Yes
OpenAI API	Built-in	Compatible	Built-in	Built-in
Structured Output	Yes	Yes	Yes	Via grammar

vLLM Production Setup


# production_server.py
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

# Production configuration
engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,           # Split across 2 GPUs
    gpu_memory_utilization=0.9,       # Use 90% of VRAM
    max_model_len=8192,               # Maximum context
    enable_prefix_caching=True,       # Cache common prefixes
    max_num_seqs=256,                 # Maximum concurrent sequences
    max_num_batched_tokens=32768,     # Maximum tokens per batch
    quantization="awq",              # Use AWQ quantization
)

engine = AsyncLLMEngine.from_engine_args(engine_args)


# Launch vLLM server with production settings
vllm serve meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --quantization awq \
  --port 8000

Multi-Model Routing


import aiohttp
import asyncio
from typing import Dict

class ModelRouter:
    def __init__(self):
        self.models: Dict[str, str] = {
            "fast": "http://localhost:8001",      # Small model
            "balanced": "http://localhost:8002",   # Medium model
            "quality": "http://localhost:8003",    # Large model
        }

    def select_model(self, request) -> str:
        prompt_length = len(request.get("messages", []))
        max_tokens = request.get("max_tokens", 256)

        # Route based on complexity
        if max_tokens <= 50 and prompt_length <= 2:
            return "fast"
        elif max_tokens <= 500:
            return "balanced"
        else:
            return "quality"

    async def route(self, request):
        tier = self.select_model(request)
        endpoint = self.models[tier]

        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{endpoint}/v1/chat/completions",
                json=request,
            ) as resp:
                return await resp.json()

Auto-Scaling Configuration


# Kubernetes HPA for GPU inference
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-server
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"
    - type: Pods
      pods:
        metric:
          name: request_queue_size
        target:
          type: AverageValue
          averageValue: "10"

Best Practices

Use vLLM for most production deployments — Best balance of features and performance
Enable continuous batching — Dramatic throughput improvement over static batching
Use prefix caching — Reduces latency for repeated system prompts
Right-size GPU memory — Set gpu_memory_utilization=0.9 to maximize throughput
Use tensor parallelism — Split large models across GPUs for lower latency
Quantize for throughput — AWQ quantization gives 2x throughput with minimal quality loss
Implement request queuing — Handle bursts without overloading the GPU
Monitor tail latency — P99 latency matters more than average
Use streaming responses — Better user experience and lower perceived latency
Cache model weights — Use shared memory or pre-loaded containers for fast startup

Troubleshooting

High latency under load


# Check if continuous batching is working
# Monitor queue depth — if growing, add more replicas
# Reduce max_model_len to free VRAM for more concurrent requests
--max-model-len 4096  # Instead of 8192

GPU utilization low


# Increase max concurrent sequences
--max-num-seqs 512
# Enable prefix caching for repeated prompts
--enable-prefix-caching
# Check if data loading is the bottleneck

Model loading OOM


# Use quantization
--quantization awq
# Reduce GPU memory fraction
--gpu-memory-utilization 0.8
# Use tensor parallelism to split across GPUs
--tensor-parallel-size 2

⚠️ Loading Issue

Advanced Inference Platform

LLM Inference Serving Platform

Overview

When to Use

Quick Start

Engine Comparison

vLLM Production Setup

Multi-Model Routing

Auto-Scaling Configuration

Best Practices

Troubleshooting

High latency under load

GPU utilization low

Model loading OOM

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace