I

Inference Serving Complete

All-in-one skill covering optimizes, inference, nvidia, tensorrt. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Complete LLM Inference Serving with SGLang

Overview

A comprehensive skill for high-performance LLM serving using SGLang — the inference engine that achieves the highest throughput through RadixAttention (automatic KV cache reuse), continuous batching, and optimized CUDA kernels. SGLang provides an OpenAI-compatible API plus a unique programming interface for complex multi-step LLM programs with branching, forking, and constrained generation.

When to Use

  • Need highest possible inference throughput
  • Serving structured generation (JSON, regex-constrained)
  • Building complex multi-step LLM programs
  • Need automatic KV cache reuse across requests
  • Serving models with shared system prompts
  • Need efficient prefix/suffix caching

Quick Start

# Install pip install sglang[all] # Launch server python -m sglang.launch_server \ --model meta-llama/Llama-3-8B-Instruct \ --port 30000 # Use as OpenAI-compatible API curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "default", "messages": [{"role": "user", "content": "Hello"}]}'

RadixAttention

Standard KV Cache (vLLM):
Request 1: [System Prompt | User Message A] → compute all KV
Request 2: [System Prompt | User Message B] → recompute system prompt KV

RadixAttention (SGLang):
Request 1: [System Prompt | User Message A] → compute all KV
Request 2: [System Prompt | User Message B] → reuse system prompt KV!

Speedup: 2-5x for requests sharing common prefixes

SGLang Programming Interface

import sglang as sgl @sgl.function def multi_step_analysis(s, text): s += sgl.user(f"Analyze this text: {text}") s += sgl.assistant(sgl.gen("analysis", max_tokens=200)) s += sgl.user("Based on your analysis, what are the key themes?") s += sgl.assistant(sgl.gen("themes", max_tokens=100)) s += sgl.user("Rate the sentiment from 1-10") s += sgl.assistant(sgl.gen("sentiment", max_tokens=5, regex=r"[0-9]|10")) # Run state = multi_step_analysis.run(text="Sample text for analysis") print(state["analysis"]) print(state["themes"]) print(state["sentiment"])

Branching and Forking

@sgl.function def compare_perspectives(s, topic): s += sgl.system("You are a balanced analyst.") s += sgl.user(f"Discuss: {topic}") # Fork into two branches (parallel execution) fork = s.fork(2) fork[0] += sgl.user("Argue FOR this position") fork[0] += sgl.assistant(sgl.gen("for_argument", max_tokens=200)) fork[1] += sgl.user("Argue AGAINST this position") fork[1] += sgl.assistant(sgl.gen("against_argument", max_tokens=200)) # Rejoin with both perspectives s += sgl.user(f""" FOR: {fork[0]["for_argument"]} AGAINST: {fork[1]["against_argument"]} Now give a balanced conclusion. """) s += sgl.assistant(sgl.gen("conclusion", max_tokens=200))

Constrained Generation

@sgl.function def structured_output(s, question): s += sgl.user(question) s += sgl.assistant( sgl.gen("answer", max_tokens=500, # JSON schema constraint json_schema={ "type": "object", "properties": { "answer": {"type": "string"}, "confidence": {"type": "number", "minimum": 0, "maximum": 1}, "sources": {"type": "array", "items": {"type": "string"}}, }, "required": ["answer", "confidence"], }) )

Server Configuration

python -m sglang.launch_server \ --model meta-llama/Llama-3-8B-Instruct \ --port 30000 \ --tp-size 2 \ # Tensor parallelism --mem-fraction-static 0.85 \ # GPU memory for KV cache --max-running-requests 256 \ # Max concurrent requests --context-length 8192 \ # Max context --quantization awq \ # Quantization method --enable-torch-compile \ # Compile for speed --chunked-prefill-size 8192 # Chunked prefill

Configuration Reference

ParameterDefaultDescription
--tp-size1Tensor parallelism degree
--mem-fraction-static0.85GPU memory for KV cache
--max-running-requests256Maximum concurrent requests
--context-lengthModel defaultMaximum context length
--quantizationNoneawq, gptq, fp8, etc.
--enable-torch-compileFalseTorch compilation for speed
--schedule-policylpmScheduling: lpm, random, fcfs
--chunked-prefill-size8192Prefill chunk size

Best Practices

  1. Use SGLang for shared-prefix workloads — RadixAttention gives biggest gains with shared system prompts
  2. Use the SGLang programming interface — More efficient than chaining API calls
  3. Enable torch compile — 10-20% speedup after warmup
  4. Use constrained generation — JSON schema or regex for reliable structured output
  5. Fork for parallel analysis — Use s.fork() for multiple perspectives or translations
  6. Set mem-fraction-static to 0.85 — Reserve some memory for activations
  7. Monitor RadixAttention hit rate — Higher hit rate = more prefix cache reuse
  8. Batch similar requests — Similar prefixes maximize cache efficiency
  9. Use chunked prefill — Better latency for long prompts
  10. Profile with the SGLang dashboard — Built-in monitoring at /metrics

Troubleshooting

Low cache hit rate

# Ensure shared system prompts are identical (byte-for-byte) # Sort requests by prefix similarity # Use longer system prompts to maximize shared prefix

High memory usage

# Reduce static memory fraction --mem-fraction-static 0.75 # Reduce max running requests --max-running-requests 128 # Use quantization --quantization awq

Slow first request

# Enable torch compile (slow first request, faster subsequent) --enable-torch-compile # Or pre-warm with dummy request curl http://localhost:30000/v1/chat/completions \ -d '{"model": "default", "messages": [{"role": "user", "content": "warmup"}]}'
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates