Advanced Inference Platform
Production-ready skill that handles fast, structured, generation, serving. Includes structured workflows, validation checks, and reusable patterns for ai research.
LLM Inference Serving Platform
Overview
A comprehensive skill for deploying and serving LLMs in production — covering high-throughput inference engines (vLLM, TGI, SGLang), API gateway design, load balancing, auto-scaling, model routing, and cost optimization. Enables serving LLMs at scale with sub-100ms latency, continuous batching, and efficient GPU utilization.
When to Use
- Deploying LLMs to production
- Need high-throughput inference (1000+ req/sec)
- Serving multiple models with intelligent routing
- Auto-scaling GPU instances based on demand
- Optimizing inference costs (GPU, tokens, latency)
- Building OpenAI-compatible API endpoints
Quick Start
# vLLM (most popular) pip install vllm vllm serve meta-llama/Llama-3-8B-Instruct --port 8000 # Text Generation Inference (TGI) docker run --gpus all -p 8080:80 \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-3-8B-Instruct # SGLang (fastest for complex prompts) pip install sglang python -m sglang.launch_server --model meta-llama/Llama-3-8B-Instruct --port 30000
Engine Comparison
| Feature | vLLM | TGI | SGLang | llama.cpp |
|---|---|---|---|---|
| Throughput | Very high | High | Highest | Medium |
| Latency | Low | Low | Lowest | Medium |
| Continuous Batching | Yes | Yes | Yes | Limited |
| PagedAttention | Yes | Yes | Yes (RadixAttention) | No |
| Tensor Parallelism | Yes | Yes | Yes | No |
| Quantization | AWQ, GPTQ | AWQ, GPTQ, EETQ | AWQ, GPTQ | GGUF (all) |
| CPU Support | No | No | No | Yes |
| OpenAI API | Built-in | Compatible | Built-in | Built-in |
| Structured Output | Yes | Yes | Yes | Via grammar |
vLLM Production Setup
# production_server.py from vllm import LLM, SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine # Production configuration engine_args = AsyncEngineArgs( model="meta-llama/Llama-3-8B-Instruct", tensor_parallel_size=2, # Split across 2 GPUs gpu_memory_utilization=0.9, # Use 90% of VRAM max_model_len=8192, # Maximum context enable_prefix_caching=True, # Cache common prefixes max_num_seqs=256, # Maximum concurrent sequences max_num_batched_tokens=32768, # Maximum tokens per batch quantization="awq", # Use AWQ quantization ) engine = AsyncLLMEngine.from_engine_args(engine_args)
# Launch vLLM server with production settings vllm serve meta-llama/Llama-3-8B-Instruct \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --enable-prefix-caching \ --max-num-seqs 256 \ --quantization awq \ --port 8000
Multi-Model Routing
import aiohttp import asyncio from typing import Dict class ModelRouter: def __init__(self): self.models: Dict[str, str] = { "fast": "http://localhost:8001", # Small model "balanced": "http://localhost:8002", # Medium model "quality": "http://localhost:8003", # Large model } def select_model(self, request) -> str: prompt_length = len(request.get("messages", [])) max_tokens = request.get("max_tokens", 256) # Route based on complexity if max_tokens <= 50 and prompt_length <= 2: return "fast" elif max_tokens <= 500: return "balanced" else: return "quality" async def route(self, request): tier = self.select_model(request) endpoint = self.models[tier] async with aiohttp.ClientSession() as session: async with session.post( f"{endpoint}/v1/chat/completions", json=request, ) as resp: return await resp.json()
Auto-Scaling Configuration
# Kubernetes HPA for GPU inference apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-serving-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-server minReplicas: 1 maxReplicas: 8 metrics: - type: Pods pods: metric: name: gpu_utilization target: type: AverageValue averageValue: "80" - type: Pods pods: metric: name: request_queue_size target: type: AverageValue averageValue: "10"
Best Practices
- Use vLLM for most production deployments — Best balance of features and performance
- Enable continuous batching — Dramatic throughput improvement over static batching
- Use prefix caching — Reduces latency for repeated system prompts
- Right-size GPU memory — Set
gpu_memory_utilization=0.9to maximize throughput - Use tensor parallelism — Split large models across GPUs for lower latency
- Quantize for throughput — AWQ quantization gives 2x throughput with minimal quality loss
- Implement request queuing — Handle bursts without overloading the GPU
- Monitor tail latency — P99 latency matters more than average
- Use streaming responses — Better user experience and lower perceived latency
- Cache model weights — Use shared memory or pre-loaded containers for fast startup
Troubleshooting
High latency under load
# Check if continuous batching is working # Monitor queue depth — if growing, add more replicas # Reduce max_model_len to free VRAM for more concurrent requests --max-model-len 4096 # Instead of 8192
GPU utilization low
# Increase max concurrent sequences --max-num-seqs 512 # Enable prefix caching for repeated prompts --enable-prefix-caching # Check if data loading is the bottleneck
Model loading OOM
# Use quantization --quantization awq # Reduce GPU memory fraction --gpu-memory-utilization 0.8 # Use tensor parallelism to split across GPUs --tensor-parallel-size 2
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.