Complete LLM Inference Serving with SGLang

Overview

A comprehensive skill for high-performance LLM serving using SGLang — the inference engine that achieves the highest throughput through RadixAttention (automatic KV cache reuse), continuous batching, and optimized CUDA kernels. SGLang provides an OpenAI-compatible API plus a unique programming interface for complex multi-step LLM programs with branching, forking, and constrained generation.

When to Use

Need highest possible inference throughput
Serving structured generation (JSON, regex-constrained)
Building complex multi-step LLM programs
Need automatic KV cache reuse across requests
Serving models with shared system prompts
Need efficient prefix/suffix caching

Quick Start


# Install
pip install sglang[all]

# Launch server
python -m sglang.launch_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --port 30000

# Use as OpenAI-compatible API
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "Hello"}]}'

RadixAttention

Standard KV Cache (vLLM):
Request 1: [System Prompt | User Message A] → compute all KV
Request 2: [System Prompt | User Message B] → recompute system prompt KV

RadixAttention (SGLang):
Request 1: [System Prompt | User Message A] → compute all KV
Request 2: [System Prompt | User Message B] → reuse system prompt KV!

Speedup: 2-5x for requests sharing common prefixes

SGLang Programming Interface


import sglang as sgl

@sgl.function
def multi_step_analysis(s, text):
    s += sgl.user(f"Analyze this text: {text}")
    s += sgl.assistant(sgl.gen("analysis", max_tokens=200))

    s += sgl.user("Based on your analysis, what are the key themes?")
    s += sgl.assistant(sgl.gen("themes", max_tokens=100))

    s += sgl.user("Rate the sentiment from 1-10")
    s += sgl.assistant(sgl.gen("sentiment", max_tokens=5,
                               regex=r"[0-9]|10"))

# Run
state = multi_step_analysis.run(text="Sample text for analysis")
print(state["analysis"])
print(state["themes"])
print(state["sentiment"])

Branching and Forking


@sgl.function
def compare_perspectives(s, topic):
    s += sgl.system("You are a balanced analyst.")
    s += sgl.user(f"Discuss: {topic}")

    # Fork into two branches (parallel execution)
    fork = s.fork(2)

    fork[0] += sgl.user("Argue FOR this position")
    fork[0] += sgl.assistant(sgl.gen("for_argument", max_tokens=200))

    fork[1] += sgl.user("Argue AGAINST this position")
    fork[1] += sgl.assistant(sgl.gen("against_argument", max_tokens=200))

    # Rejoin with both perspectives
    s += sgl.user(f"""
    FOR: {fork[0]["for_argument"]}
    AGAINST: {fork[1]["against_argument"]}

    Now give a balanced conclusion.
    """)
    s += sgl.assistant(sgl.gen("conclusion", max_tokens=200))

Constrained Generation


@sgl.function
def structured_output(s, question):
    s += sgl.user(question)
    s += sgl.assistant(
        sgl.gen("answer",
                max_tokens=500,
                # JSON schema constraint
                json_schema={
                    "type": "object",
                    "properties": {
                        "answer": {"type": "string"},
                        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
                        "sources": {"type": "array", "items": {"type": "string"}},
                    },
                    "required": ["answer", "confidence"],
                })
    )

Server Configuration


python -m sglang.launch_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --port 30000 \
  --tp-size 2 \                      # Tensor parallelism
  --mem-fraction-static 0.85 \       # GPU memory for KV cache
  --max-running-requests 256 \       # Max concurrent requests
  --context-length 8192 \            # Max context
  --quantization awq \               # Quantization method
  --enable-torch-compile \           # Compile for speed
  --chunked-prefill-size 8192        # Chunked prefill

Configuration Reference

Parameter	Default	Description
`--tp-size`	1	Tensor parallelism degree
`--mem-fraction-static`	0.85	GPU memory for KV cache
`--max-running-requests`	256	Maximum concurrent requests
`--context-length`	Model default	Maximum context length
`--quantization`	None	awq, gptq, fp8, etc.
`--enable-torch-compile`	False	Torch compilation for speed
`--schedule-policy`	lpm	Scheduling: lpm, random, fcfs
`--chunked-prefill-size`	8192	Prefill chunk size

Best Practices

Use SGLang for shared-prefix workloads — RadixAttention gives biggest gains with shared system prompts
Use the SGLang programming interface — More efficient than chaining API calls
Enable torch compile — 10-20% speedup after warmup
Use constrained generation — JSON schema or regex for reliable structured output
Fork for parallel analysis — Use s.fork() for multiple perspectives or translations
Set mem-fraction-static to 0.85 — Reserve some memory for activations
Monitor RadixAttention hit rate — Higher hit rate = more prefix cache reuse
Batch similar requests — Similar prefixes maximize cache efficiency
Use chunked prefill — Better latency for long prompts
Profile with the SGLang dashboard — Built-in monitoring at /metrics

Troubleshooting

Low cache hit rate


# Ensure shared system prompts are identical (byte-for-byte)
# Sort requests by prefix similarity
# Use longer system prompts to maximize shared prefix

High memory usage


# Reduce static memory fraction
--mem-fraction-static 0.75
# Reduce max running requests
--max-running-requests 128
# Use quantization
--quantization awq

Slow first request


# Enable torch compile (slow first request, faster subsequent)
--enable-torch-compile
# Or pre-warm with dummy request
curl http://localhost:30000/v1/chat/completions \
  -d '{"model": "default", "messages": [{"role": "user", "content": "warmup"}]}'

⚠️ Loading Issue

Inference Serving Complete