Advanced Optimization Platform
Enterprise-grade skill for activation, aware, weight, quantization. Includes structured workflows, validation checks, and reusable patterns for ai research.
LLM Optimization Platform -- Inference Speed, Memory, and Cost
Overview
A comprehensive skill for optimizing large language model inference across speed, memory efficiency, and deployment cost. This covers the full optimization stack: quantization (GPTQ, AWQ, bitsandbytes, GGUF), attention optimization (Flash Attention, PagedAttention), batching strategies (continuous batching, dynamic batching), speculative decoding, model pruning and distillation, and production inference frameworks (vLLM, TensorRT-LLM, SGLang). It also covers NVIDIA Model Optimizer for unified quantization and deployment workflows. This skill enables deploying LLMs on hardware ranging from consumer laptops to multi-GPU production clusters while maximizing throughput and minimizing latency.
When to Use
- Reducing GPU memory requirements to run larger models on available hardware
- Improving inference throughput for production LLM serving
- Reducing per-token latency for real-time applications
- Cutting inference costs through quantization or efficient batching
- Deploying models on edge devices, laptops, or consumer GPUs
- Selecting the right inference framework for your deployment target
- Optimizing transformer attention for long-context workloads
Quick Start
vLLM -- Production Inference Server
pip install vllm
from vllm import LLM, SamplingParams # Load model with automatic optimization llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", dtype="auto", # Auto-select precision gpu_memory_utilization=0.9, # Use 90% of GPU memory max_model_len=8192, # Context window ) sampling_params = SamplingParams(temperature=0.7, max_tokens=256) # Batch inference with continuous batching prompts = ["Explain quantum computing.", "Write a haiku about Python."] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)
vLLM OpenAI-Compatible Server
# Start server vllm serve meta-llama/Llama-3.1-8B-Instruct \ --dtype auto \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 \ --port 8000 # Use with OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused") response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Hello!"}], )
Core Concepts
Optimization Techniques Overview
| Technique | Speedup | Memory Savings | Quality Impact | Complexity |
|---|---|---|---|---|
| FP16/BF16 | 1.5-2x | 50% | None | Low |
| INT8 Quantization | 1.5-2x | 50% | <1% loss | Low |
| INT4 Quantization | 2-3x | 75% | 1-3% loss | Low |
| Flash Attention | 2-4x | 5-20x (attention) | None | Low |
| PagedAttention | 1x | 2-4x (KV cache) | None | Low |
| Continuous Batching | 2-10x throughput | Shared | None | Medium |
| Speculative Decoding | 2-3x latency | Small overhead | None | Medium |
| Pruning + Distillation | 2-4x | 50-75% | 2-5% loss | High |
| TensorRT Compilation | 2-5x | Variable | None | High |
Quantization Methods
# Method 1: bitsandbytes (simplest, load-time quantization) from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=config, device_map="auto", ) # 8B model: 16 GB FP16 -> 4 GB INT4
# Method 2: AWQ (calibration-based, better quality) from awq import AutoAWQForCausalLM model = AutoAWQForCausalLM.from_quantized( "TheBloke/Llama-2-7B-AWQ", fuse_layers=True, # Fuse for faster inference device_map="auto", )
# Method 3: GPTQ (calibration-based, established) from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-GPTQ", device_map="auto", )
Quantization Method Comparison
| Method | Calibration | GPU Required | Speed | Quality | Best For |
|---|---|---|---|---|---|
| bitsandbytes | No | Load-time | Good | Good | Quick experimentation |
| AWQ | Yes | Quantize-time | Best | Better | Production GPU serving |
| GPTQ | Yes | Quantize-time | Good | Better | Production GPU serving |
| GGUF | Optional (imatrix) | No | Good | Good | CPU/Apple Silicon |
| FP8 | No | H100+ | Best | Excellent | H100 deployments |
Inference Frameworks
vLLM Features
from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=2, # Multi-GPU parallelism quantization="awq", # Use quantized model enforce_eager=False, # Enable CUDA graphs enable_prefix_caching=True, # Cache common prefixes ) # PagedAttention: manages KV cache like virtual memory # - No fragmentation from variable-length sequences # - Supports beam search with shared prefixes # - 2-4x memory efficiency for KV cache # Continuous batching: add/remove requests dynamically # - No waiting for batch completion # - Near-optimal GPU utilization
TensorRT-LLM
# Install pip install tensorrt-llm # Convert and optimize model trtllm-build \ --checkpoint_dir ./llama-checkpoint \ --output_dir ./llama-engine \ --gemm_plugin float16 \ --max_batch_size 64 \ --max_input_len 2048 \ --max_seq_len 4096
import tensorrt_llm from tensorrt_llm.runtime import ModelRunner runner = ModelRunner.from_dir("./llama-engine") outputs = runner.generate( batch_input_ids=input_ids, max_new_tokens=256, temperature=0.7, )
SGLang -- Fast Structured Generation
pip install sglang[all] # Start server python -m sglang.launch_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8000
import sglang as sgl @sgl.function def qa_pipeline(s, question): s += sgl.system("You are a helpful assistant.") s += sgl.user(question) s += sgl.assistant(sgl.gen("answer", max_tokens=256)) state = qa_pipeline.run(question="What is RAG?") print(state["answer"])
Speculative Decoding
Use a small draft model to propose tokens, verified by the large model in parallel:
from vllm import LLM, SamplingParams # vLLM with speculative decoding llm = LLM( model="meta-llama/Llama-3.1-70B-Instruct", speculative_model="meta-llama/Llama-3.1-8B-Instruct", num_speculative_tokens=5, # Draft 5 tokens at a time tensor_parallel_size=4, ) # ~2-3x latency reduction for code and structured output outputs = llm.generate(["Write a Python function to sort a list."], SamplingParams(temperature=0.0, max_tokens=256))
NVIDIA Model Optimizer
import modelopt.torch.quantization as mtq from transformers import AutoModelForCausalLM # Load model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", torch_dtype="auto", device_map="auto", ) # Quantize to INT4 AWQ model = mtq.quantize(model, mtq.INT4_AWQ_CFG, forward_loop=calibrate) # Export for vLLM / TensorRT-LLM deployment mtq.export(model, "quantized-model")
Memory Estimation
def estimate_memory_gb(params_billions: float, precision: str, overhead: float = 1.2) -> float: """Estimate GPU memory for model loading.""" bytes_per_param = { "fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5, "fp8": 1, } base = params_billions * bytes_per_param[precision] return base * overhead # 20% overhead for activations, KV cache # Examples print(f"Llama 3.1 8B FP16: {estimate_memory_gb(8, 'fp16'):.1f} GB") # 19.2 GB print(f"Llama 3.1 8B INT4: {estimate_memory_gb(8, 'int4'):.1f} GB") # 4.8 GB print(f"Llama 3.1 70B FP16: {estimate_memory_gb(70, 'fp16'):.1f} GB") # 168.0 GB print(f"Llama 3.1 70B INT4: {estimate_memory_gb(70, 'int4'):.1f} GB") # 42.0 GB
Hardware Sizing Guide
| Model | FP16 VRAM | INT4 VRAM | Recommended GPU |
|---|---|---|---|
| 3B | 6 GB | 2 GB | RTX 3060 (12 GB) |
| 7-8B | 16 GB | 4 GB | RTX 4070 (12 GB) |
| 13B | 26 GB | 7 GB | RTX 4090 (24 GB) |
| 34B | 68 GB | 17 GB | A100 40 GB |
| 70B | 140 GB | 35 GB | 2x A100 80 GB |
| 405B | 810 GB | 203 GB | 8x A100 80 GB |
Configuration Reference
vLLM Server Parameters
| Parameter | Default | Description |
|---|---|---|
--model | Required | HuggingFace model ID or path |
--dtype | auto | Data type: auto, float16, bfloat16, float32 |
--quantization | None | Quantization method: awq, gptq, squeezellm |
--tensor-parallel-size | 1 | Number of GPUs for tensor parallelism |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory to use |
--max-model-len | Model default | Maximum sequence length |
--max-num-seqs | 256 | Maximum concurrent sequences |
--enable-prefix-caching | False | Cache common prompt prefixes |
--speculative-model | None | Draft model for speculative decoding |
--num-speculative-tokens | None | Tokens per draft step |
Best Practices
- Start with quantization -- INT4 quantization (AWQ or bitsandbytes NF4) gives the highest impact with the least effort, reducing memory by 75% with minimal quality loss.
- Use vLLM or SGLang for production serving -- These frameworks provide continuous batching, PagedAttention, and prefix caching out of the box, dramatically outperforming naive inference.
- Match quantization method to deployment target -- Use AWQ/GPTQ for GPU serving, GGUF for CPU/Apple Silicon, bitsandbytes for quick experimentation, FP8 for H100.
- Enable Flash Attention everywhere -- PyTorch 2.2+ includes Flash Attention via
scaled_dot_product_attention. Ensure your framework uses it for 2-4x attention speedup. - Profile before optimizing -- Use
torch.profilerornsysto identify actual bottlenecks (memory, compute, I/O) before applying optimization techniques. - Size your hardware to the quantized model -- Calculate INT4 memory requirements and add 20-30% overhead for KV cache and activations when selecting GPU instances.
- Use speculative decoding for latency-sensitive tasks -- For single-request latency (not throughput), speculative decoding with a small draft model can provide 2-3x speedup.
- Enable prefix caching for repeated system prompts -- If many requests share the same system prompt or few-shot examples, prefix caching avoids redundant computation.
- Benchmark with realistic workloads -- Test with production-representative prompt lengths, batch sizes, and concurrency. Synthetic benchmarks often overestimate throughput.
- Consider distillation for maximum efficiency -- If you control the training pipeline, distilling a large model into a smaller one often outperforms quantization of the large model.
Troubleshooting
CUDA out of memory when loading model:
Reduce gpu_memory_utilization, use quantization (INT4/INT8), enable tensor parallelism across multiple GPUs, or reduce max_model_len to shrink KV cache allocation.
vLLM throughput lower than expected:
Ensure enforce_eager=False to enable CUDA graphs. Increase max_num_seqs to allow more concurrent batching. Check that GPU utilization is >90% with nvidia-smi.
Quantized model produces worse output quality:
Switch from INT4 to INT8 quantization. Use AWQ with calibration data instead of bitsandbytes NF4. For GGUF, use importance matrix (imatrix) quantization for better low-bit quality.
Speculative decoding not improving latency: Speculative decoding works best when the draft model's acceptance rate is high. Use a draft model from the same family. For creative/high-temperature generation, acceptance rates drop and benefits diminish.
TensorRT-LLM build fails: Verify CUDA toolkit version matches TensorRT requirements. Ensure sufficient disk space for the compiled engine (can be 2-3x model size). Check that the model architecture is supported.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.