Inference Serving Complete
All-in-one skill covering optimizes, inference, nvidia, tensorrt. Includes structured workflows, validation checks, and reusable patterns for ai research.
Complete LLM Inference Serving with SGLang
Overview
A comprehensive skill for high-performance LLM serving using SGLang — the inference engine that achieves the highest throughput through RadixAttention (automatic KV cache reuse), continuous batching, and optimized CUDA kernels. SGLang provides an OpenAI-compatible API plus a unique programming interface for complex multi-step LLM programs with branching, forking, and constrained generation.
When to Use
- Need highest possible inference throughput
- Serving structured generation (JSON, regex-constrained)
- Building complex multi-step LLM programs
- Need automatic KV cache reuse across requests
- Serving models with shared system prompts
- Need efficient prefix/suffix caching
Quick Start
# Install pip install sglang[all] # Launch server python -m sglang.launch_server \ --model meta-llama/Llama-3-8B-Instruct \ --port 30000 # Use as OpenAI-compatible API curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "default", "messages": [{"role": "user", "content": "Hello"}]}'
RadixAttention
Standard KV Cache (vLLM):
Request 1: [System Prompt | User Message A] → compute all KV
Request 2: [System Prompt | User Message B] → recompute system prompt KV
RadixAttention (SGLang):
Request 1: [System Prompt | User Message A] → compute all KV
Request 2: [System Prompt | User Message B] → reuse system prompt KV!
Speedup: 2-5x for requests sharing common prefixes
SGLang Programming Interface
import sglang as sgl @sgl.function def multi_step_analysis(s, text): s += sgl.user(f"Analyze this text: {text}") s += sgl.assistant(sgl.gen("analysis", max_tokens=200)) s += sgl.user("Based on your analysis, what are the key themes?") s += sgl.assistant(sgl.gen("themes", max_tokens=100)) s += sgl.user("Rate the sentiment from 1-10") s += sgl.assistant(sgl.gen("sentiment", max_tokens=5, regex=r"[0-9]|10")) # Run state = multi_step_analysis.run(text="Sample text for analysis") print(state["analysis"]) print(state["themes"]) print(state["sentiment"])
Branching and Forking
@sgl.function def compare_perspectives(s, topic): s += sgl.system("You are a balanced analyst.") s += sgl.user(f"Discuss: {topic}") # Fork into two branches (parallel execution) fork = s.fork(2) fork[0] += sgl.user("Argue FOR this position") fork[0] += sgl.assistant(sgl.gen("for_argument", max_tokens=200)) fork[1] += sgl.user("Argue AGAINST this position") fork[1] += sgl.assistant(sgl.gen("against_argument", max_tokens=200)) # Rejoin with both perspectives s += sgl.user(f""" FOR: {fork[0]["for_argument"]} AGAINST: {fork[1]["against_argument"]} Now give a balanced conclusion. """) s += sgl.assistant(sgl.gen("conclusion", max_tokens=200))
Constrained Generation
@sgl.function def structured_output(s, question): s += sgl.user(question) s += sgl.assistant( sgl.gen("answer", max_tokens=500, # JSON schema constraint json_schema={ "type": "object", "properties": { "answer": {"type": "string"}, "confidence": {"type": "number", "minimum": 0, "maximum": 1}, "sources": {"type": "array", "items": {"type": "string"}}, }, "required": ["answer", "confidence"], }) )
Server Configuration
python -m sglang.launch_server \ --model meta-llama/Llama-3-8B-Instruct \ --port 30000 \ --tp-size 2 \ # Tensor parallelism --mem-fraction-static 0.85 \ # GPU memory for KV cache --max-running-requests 256 \ # Max concurrent requests --context-length 8192 \ # Max context --quantization awq \ # Quantization method --enable-torch-compile \ # Compile for speed --chunked-prefill-size 8192 # Chunked prefill
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
--tp-size | 1 | Tensor parallelism degree |
--mem-fraction-static | 0.85 | GPU memory for KV cache |
--max-running-requests | 256 | Maximum concurrent requests |
--context-length | Model default | Maximum context length |
--quantization | None | awq, gptq, fp8, etc. |
--enable-torch-compile | False | Torch compilation for speed |
--schedule-policy | lpm | Scheduling: lpm, random, fcfs |
--chunked-prefill-size | 8192 | Prefill chunk size |
Best Practices
- Use SGLang for shared-prefix workloads — RadixAttention gives biggest gains with shared system prompts
- Use the SGLang programming interface — More efficient than chaining API calls
- Enable torch compile — 10-20% speedup after warmup
- Use constrained generation — JSON schema or regex for reliable structured output
- Fork for parallel analysis — Use
s.fork()for multiple perspectives or translations - Set
mem-fraction-staticto 0.85 — Reserve some memory for activations - Monitor RadixAttention hit rate — Higher hit rate = more prefix cache reuse
- Batch similar requests — Similar prefixes maximize cache efficiency
- Use chunked prefill — Better latency for long prompts
- Profile with the SGLang dashboard — Built-in monitoring at
/metrics
Troubleshooting
Low cache hit rate
# Ensure shared system prompts are identical (byte-for-byte) # Sort requests by prefix similarity # Use longer system prompts to maximize shared prefix
High memory usage
# Reduce static memory fraction --mem-fraction-static 0.75 # Reduce max running requests --max-running-requests 128 # Use quantization --quantization awq
Slow first request
# Enable torch compile (slow first request, faster subsequent) --enable-torch-compile # Or pre-warm with dummy request curl http://localhost:30000/v1/chat/completions \ -d '{"model": "default", "messages": [{"role": "user", "content": "warmup"}]}'
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.