A

Advanced Optimization Platform

Enterprise-grade skill for activation, aware, weight, quantization. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

LLM Optimization Platform -- Inference Speed, Memory, and Cost

Overview

A comprehensive skill for optimizing large language model inference across speed, memory efficiency, and deployment cost. This covers the full optimization stack: quantization (GPTQ, AWQ, bitsandbytes, GGUF), attention optimization (Flash Attention, PagedAttention), batching strategies (continuous batching, dynamic batching), speculative decoding, model pruning and distillation, and production inference frameworks (vLLM, TensorRT-LLM, SGLang). It also covers NVIDIA Model Optimizer for unified quantization and deployment workflows. This skill enables deploying LLMs on hardware ranging from consumer laptops to multi-GPU production clusters while maximizing throughput and minimizing latency.

When to Use

  • Reducing GPU memory requirements to run larger models on available hardware
  • Improving inference throughput for production LLM serving
  • Reducing per-token latency for real-time applications
  • Cutting inference costs through quantization or efficient batching
  • Deploying models on edge devices, laptops, or consumer GPUs
  • Selecting the right inference framework for your deployment target
  • Optimizing transformer attention for long-context workloads

Quick Start

vLLM -- Production Inference Server

pip install vllm
from vllm import LLM, SamplingParams # Load model with automatic optimization llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", dtype="auto", # Auto-select precision gpu_memory_utilization=0.9, # Use 90% of GPU memory max_model_len=8192, # Context window ) sampling_params = SamplingParams(temperature=0.7, max_tokens=256) # Batch inference with continuous batching prompts = ["Explain quantum computing.", "Write a haiku about Python."] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)

vLLM OpenAI-Compatible Server

# Start server vllm serve meta-llama/Llama-3.1-8B-Instruct \ --dtype auto \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 \ --port 8000 # Use with OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused") response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Hello!"}], )

Core Concepts

Optimization Techniques Overview

TechniqueSpeedupMemory SavingsQuality ImpactComplexity
FP16/BF161.5-2x50%NoneLow
INT8 Quantization1.5-2x50%<1% lossLow
INT4 Quantization2-3x75%1-3% lossLow
Flash Attention2-4x5-20x (attention)NoneLow
PagedAttention1x2-4x (KV cache)NoneLow
Continuous Batching2-10x throughputSharedNoneMedium
Speculative Decoding2-3x latencySmall overheadNoneMedium
Pruning + Distillation2-4x50-75%2-5% lossHigh
TensorRT Compilation2-5xVariableNoneHigh

Quantization Methods

# Method 1: bitsandbytes (simplest, load-time quantization) from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=config, device_map="auto", ) # 8B model: 16 GB FP16 -> 4 GB INT4
# Method 2: AWQ (calibration-based, better quality) from awq import AutoAWQForCausalLM model = AutoAWQForCausalLM.from_quantized( "TheBloke/Llama-2-7B-AWQ", fuse_layers=True, # Fuse for faster inference device_map="auto", )
# Method 3: GPTQ (calibration-based, established) from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-GPTQ", device_map="auto", )

Quantization Method Comparison

MethodCalibrationGPU RequiredSpeedQualityBest For
bitsandbytesNoLoad-timeGoodGoodQuick experimentation
AWQYesQuantize-timeBestBetterProduction GPU serving
GPTQYesQuantize-timeGoodBetterProduction GPU serving
GGUFOptional (imatrix)NoGoodGoodCPU/Apple Silicon
FP8NoH100+BestExcellentH100 deployments

Inference Frameworks

vLLM Features

from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=2, # Multi-GPU parallelism quantization="awq", # Use quantized model enforce_eager=False, # Enable CUDA graphs enable_prefix_caching=True, # Cache common prefixes ) # PagedAttention: manages KV cache like virtual memory # - No fragmentation from variable-length sequences # - Supports beam search with shared prefixes # - 2-4x memory efficiency for KV cache # Continuous batching: add/remove requests dynamically # - No waiting for batch completion # - Near-optimal GPU utilization

TensorRT-LLM

# Install pip install tensorrt-llm # Convert and optimize model trtllm-build \ --checkpoint_dir ./llama-checkpoint \ --output_dir ./llama-engine \ --gemm_plugin float16 \ --max_batch_size 64 \ --max_input_len 2048 \ --max_seq_len 4096
import tensorrt_llm from tensorrt_llm.runtime import ModelRunner runner = ModelRunner.from_dir("./llama-engine") outputs = runner.generate( batch_input_ids=input_ids, max_new_tokens=256, temperature=0.7, )

SGLang -- Fast Structured Generation

pip install sglang[all] # Start server python -m sglang.launch_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8000
import sglang as sgl @sgl.function def qa_pipeline(s, question): s += sgl.system("You are a helpful assistant.") s += sgl.user(question) s += sgl.assistant(sgl.gen("answer", max_tokens=256)) state = qa_pipeline.run(question="What is RAG?") print(state["answer"])

Speculative Decoding

Use a small draft model to propose tokens, verified by the large model in parallel:

from vllm import LLM, SamplingParams # vLLM with speculative decoding llm = LLM( model="meta-llama/Llama-3.1-70B-Instruct", speculative_model="meta-llama/Llama-3.1-8B-Instruct", num_speculative_tokens=5, # Draft 5 tokens at a time tensor_parallel_size=4, ) # ~2-3x latency reduction for code and structured output outputs = llm.generate(["Write a Python function to sort a list."], SamplingParams(temperature=0.0, max_tokens=256))

NVIDIA Model Optimizer

import modelopt.torch.quantization as mtq from transformers import AutoModelForCausalLM # Load model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", torch_dtype="auto", device_map="auto", ) # Quantize to INT4 AWQ model = mtq.quantize(model, mtq.INT4_AWQ_CFG, forward_loop=calibrate) # Export for vLLM / TensorRT-LLM deployment mtq.export(model, "quantized-model")

Memory Estimation

def estimate_memory_gb(params_billions: float, precision: str, overhead: float = 1.2) -> float: """Estimate GPU memory for model loading.""" bytes_per_param = { "fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5, "fp8": 1, } base = params_billions * bytes_per_param[precision] return base * overhead # 20% overhead for activations, KV cache # Examples print(f"Llama 3.1 8B FP16: {estimate_memory_gb(8, 'fp16'):.1f} GB") # 19.2 GB print(f"Llama 3.1 8B INT4: {estimate_memory_gb(8, 'int4'):.1f} GB") # 4.8 GB print(f"Llama 3.1 70B FP16: {estimate_memory_gb(70, 'fp16'):.1f} GB") # 168.0 GB print(f"Llama 3.1 70B INT4: {estimate_memory_gb(70, 'int4'):.1f} GB") # 42.0 GB

Hardware Sizing Guide

ModelFP16 VRAMINT4 VRAMRecommended GPU
3B6 GB2 GBRTX 3060 (12 GB)
7-8B16 GB4 GBRTX 4070 (12 GB)
13B26 GB7 GBRTX 4090 (24 GB)
34B68 GB17 GBA100 40 GB
70B140 GB35 GB2x A100 80 GB
405B810 GB203 GB8x A100 80 GB

Configuration Reference

vLLM Server Parameters

ParameterDefaultDescription
--modelRequiredHuggingFace model ID or path
--dtypeautoData type: auto, float16, bfloat16, float32
--quantizationNoneQuantization method: awq, gptq, squeezellm
--tensor-parallel-size1Number of GPUs for tensor parallelism
--gpu-memory-utilization0.9Fraction of GPU memory to use
--max-model-lenModel defaultMaximum sequence length
--max-num-seqs256Maximum concurrent sequences
--enable-prefix-cachingFalseCache common prompt prefixes
--speculative-modelNoneDraft model for speculative decoding
--num-speculative-tokensNoneTokens per draft step

Best Practices

  1. Start with quantization -- INT4 quantization (AWQ or bitsandbytes NF4) gives the highest impact with the least effort, reducing memory by 75% with minimal quality loss.
  2. Use vLLM or SGLang for production serving -- These frameworks provide continuous batching, PagedAttention, and prefix caching out of the box, dramatically outperforming naive inference.
  3. Match quantization method to deployment target -- Use AWQ/GPTQ for GPU serving, GGUF for CPU/Apple Silicon, bitsandbytes for quick experimentation, FP8 for H100.
  4. Enable Flash Attention everywhere -- PyTorch 2.2+ includes Flash Attention via scaled_dot_product_attention. Ensure your framework uses it for 2-4x attention speedup.
  5. Profile before optimizing -- Use torch.profiler or nsys to identify actual bottlenecks (memory, compute, I/O) before applying optimization techniques.
  6. Size your hardware to the quantized model -- Calculate INT4 memory requirements and add 20-30% overhead for KV cache and activations when selecting GPU instances.
  7. Use speculative decoding for latency-sensitive tasks -- For single-request latency (not throughput), speculative decoding with a small draft model can provide 2-3x speedup.
  8. Enable prefix caching for repeated system prompts -- If many requests share the same system prompt or few-shot examples, prefix caching avoids redundant computation.
  9. Benchmark with realistic workloads -- Test with production-representative prompt lengths, batch sizes, and concurrency. Synthetic benchmarks often overestimate throughput.
  10. Consider distillation for maximum efficiency -- If you control the training pipeline, distilling a large model into a smaller one often outperforms quantization of the large model.

Troubleshooting

CUDA out of memory when loading model: Reduce gpu_memory_utilization, use quantization (INT4/INT8), enable tensor parallelism across multiple GPUs, or reduce max_model_len to shrink KV cache allocation.

vLLM throughput lower than expected: Ensure enforce_eager=False to enable CUDA graphs. Increase max_num_seqs to allow more concurrent batching. Check that GPU utilization is >90% with nvidia-smi.

Quantized model produces worse output quality: Switch from INT4 to INT8 quantization. Use AWQ with calibration data instead of bitsandbytes NF4. For GGUF, use importance matrix (imatrix) quantization for better low-bit quality.

Speculative decoding not improving latency: Speculative decoding works best when the draft model's acceptance rate is high. Use a draft model from the same family. For creative/high-temperature generation, acceptance rates drop and benefits diminish.

TensorRT-LLM build fails: Verify CUDA toolkit version matches TensorRT requirements. Ensure sufficient disk space for the compiled engine (can be 2-3x model size). Check that the model architecture is supported.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates