A

Advanced Inference Platform

Production-ready skill that handles fast, structured, generation, serving. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

LLM Inference Serving Platform

Overview

A comprehensive skill for deploying and serving LLMs in production — covering high-throughput inference engines (vLLM, TGI, SGLang), API gateway design, load balancing, auto-scaling, model routing, and cost optimization. Enables serving LLMs at scale with sub-100ms latency, continuous batching, and efficient GPU utilization.

When to Use

  • Deploying LLMs to production
  • Need high-throughput inference (1000+ req/sec)
  • Serving multiple models with intelligent routing
  • Auto-scaling GPU instances based on demand
  • Optimizing inference costs (GPU, tokens, latency)
  • Building OpenAI-compatible API endpoints

Quick Start

# vLLM (most popular) pip install vllm vllm serve meta-llama/Llama-3-8B-Instruct --port 8000 # Text Generation Inference (TGI) docker run --gpus all -p 8080:80 \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-3-8B-Instruct # SGLang (fastest for complex prompts) pip install sglang python -m sglang.launch_server --model meta-llama/Llama-3-8B-Instruct --port 30000

Engine Comparison

FeaturevLLMTGISGLangllama.cpp
ThroughputVery highHighHighestMedium
LatencyLowLowLowestMedium
Continuous BatchingYesYesYesLimited
PagedAttentionYesYesYes (RadixAttention)No
Tensor ParallelismYesYesYesNo
QuantizationAWQ, GPTQAWQ, GPTQ, EETQAWQ, GPTQGGUF (all)
CPU SupportNoNoNoYes
OpenAI APIBuilt-inCompatibleBuilt-inBuilt-in
Structured OutputYesYesYesVia grammar

vLLM Production Setup

# production_server.py from vllm import LLM, SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine # Production configuration engine_args = AsyncEngineArgs( model="meta-llama/Llama-3-8B-Instruct", tensor_parallel_size=2, # Split across 2 GPUs gpu_memory_utilization=0.9, # Use 90% of VRAM max_model_len=8192, # Maximum context enable_prefix_caching=True, # Cache common prefixes max_num_seqs=256, # Maximum concurrent sequences max_num_batched_tokens=32768, # Maximum tokens per batch quantization="awq", # Use AWQ quantization ) engine = AsyncLLMEngine.from_engine_args(engine_args)
# Launch vLLM server with production settings vllm serve meta-llama/Llama-3-8B-Instruct \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --enable-prefix-caching \ --max-num-seqs 256 \ --quantization awq \ --port 8000

Multi-Model Routing

import aiohttp import asyncio from typing import Dict class ModelRouter: def __init__(self): self.models: Dict[str, str] = { "fast": "http://localhost:8001", # Small model "balanced": "http://localhost:8002", # Medium model "quality": "http://localhost:8003", # Large model } def select_model(self, request) -> str: prompt_length = len(request.get("messages", [])) max_tokens = request.get("max_tokens", 256) # Route based on complexity if max_tokens <= 50 and prompt_length <= 2: return "fast" elif max_tokens <= 500: return "balanced" else: return "quality" async def route(self, request): tier = self.select_model(request) endpoint = self.models[tier] async with aiohttp.ClientSession() as session: async with session.post( f"{endpoint}/v1/chat/completions", json=request, ) as resp: return await resp.json()

Auto-Scaling Configuration

# Kubernetes HPA for GPU inference apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-serving-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-server minReplicas: 1 maxReplicas: 8 metrics: - type: Pods pods: metric: name: gpu_utilization target: type: AverageValue averageValue: "80" - type: Pods pods: metric: name: request_queue_size target: type: AverageValue averageValue: "10"

Best Practices

  1. Use vLLM for most production deployments — Best balance of features and performance
  2. Enable continuous batching — Dramatic throughput improvement over static batching
  3. Use prefix caching — Reduces latency for repeated system prompts
  4. Right-size GPU memory — Set gpu_memory_utilization=0.9 to maximize throughput
  5. Use tensor parallelism — Split large models across GPUs for lower latency
  6. Quantize for throughput — AWQ quantization gives 2x throughput with minimal quality loss
  7. Implement request queuing — Handle bursts without overloading the GPU
  8. Monitor tail latency — P99 latency matters more than average
  9. Use streaming responses — Better user experience and lower perceived latency
  10. Cache model weights — Use shared memory or pre-loaded containers for fast startup

Troubleshooting

High latency under load

# Check if continuous batching is working # Monitor queue depth — if growing, add more replicas # Reduce max_model_len to free VRAM for more concurrent requests --max-model-len 4096 # Instead of 8192

GPU utilization low

# Increase max concurrent sequences --max-num-seqs 512 # Enable prefix caching for repeated prompts --enable-prefix-caching # Check if data loading is the bottleneck

Model loading OOM

# Use quantization --quantization awq # Reduce GPU memory fraction --gpu-memory-utilization 0.8 # Use tensor parallelism to split across GPUs --tensor-parallel-size 2
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates