LLM Optimization Platform -- Inference Speed, Memory, and Cost

Overview

A comprehensive skill for optimizing large language model inference across speed, memory efficiency, and deployment cost. This covers the full optimization stack: quantization (GPTQ, AWQ, bitsandbytes, GGUF), attention optimization (Flash Attention, PagedAttention), batching strategies (continuous batching, dynamic batching), speculative decoding, model pruning and distillation, and production inference frameworks (vLLM, TensorRT-LLM, SGLang). It also covers NVIDIA Model Optimizer for unified quantization and deployment workflows. This skill enables deploying LLMs on hardware ranging from consumer laptops to multi-GPU production clusters while maximizing throughput and minimizing latency.

When to Use

Reducing GPU memory requirements to run larger models on available hardware
Improving inference throughput for production LLM serving
Reducing per-token latency for real-time applications
Cutting inference costs through quantization or efficient batching
Deploying models on edge devices, laptops, or consumer GPUs
Selecting the right inference framework for your deployment target
Optimizing transformer attention for long-context workloads

Quick Start

vLLM -- Production Inference Server


pip install vllm


from vllm import LLM, SamplingParams

# Load model with automatic optimization
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    dtype="auto",              # Auto-select precision
    gpu_memory_utilization=0.9, # Use 90% of GPU memory
    max_model_len=8192,         # Context window
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

# Batch inference with continuous batching
prompts = ["Explain quantum computing.", "Write a haiku about Python."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

vLLM OpenAI-Compatible Server


# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --dtype auto \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --port 8000

# Use with OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

Core Concepts

Optimization Techniques Overview

Technique	Speedup	Memory Savings	Quality Impact	Complexity
FP16/BF16	1.5-2x	50%	None	Low
INT8 Quantization	1.5-2x	50%	<1% loss	Low
INT4 Quantization	2-3x	75%	1-3% loss	Low
Flash Attention	2-4x	5-20x (attention)	None	Low
PagedAttention	1x	2-4x (KV cache)	None	Low
Continuous Batching	2-10x throughput	Shared	None	Medium
Speculative Decoding	2-3x latency	Small overhead	None	Medium
Pruning + Distillation	2-4x	50-75%	2-5% loss	High
TensorRT Compilation	2-5x	Variable	None	High

Quantization Methods


# Method 1: bitsandbytes (simplest, load-time quantization)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=config,
    device_map="auto",
)
# 8B model: 16 GB FP16 -> 4 GB INT4


# Method 2: AWQ (calibration-based, better quality)
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-AWQ",
    fuse_layers=True,   # Fuse for faster inference
    device_map="auto",
)


# Method 3: GPTQ (calibration-based, established)
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

Quantization Method Comparison

Method	Calibration	GPU Required	Speed	Quality	Best For
bitsandbytes	No	Load-time	Good	Good	Quick experimentation
AWQ	Yes	Quantize-time	Best	Better	Production GPU serving
GPTQ	Yes	Quantize-time	Good	Better	Production GPU serving
GGUF	Optional (imatrix)	No	Good	Good	CPU/Apple Silicon
FP8	No	H100+	Best	Excellent	H100 deployments

Inference Frameworks

vLLM Features


from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,     # Multi-GPU parallelism
    quantization="awq",          # Use quantized model
    enforce_eager=False,         # Enable CUDA graphs
    enable_prefix_caching=True,  # Cache common prefixes
)

# PagedAttention: manages KV cache like virtual memory
# - No fragmentation from variable-length sequences
# - Supports beam search with shared prefixes
# - 2-4x memory efficiency for KV cache

# Continuous batching: add/remove requests dynamically
# - No waiting for batch completion
# - Near-optimal GPU utilization

TensorRT-LLM


# Install
pip install tensorrt-llm

# Convert and optimize model
trtllm-build \
    --checkpoint_dir ./llama-checkpoint \
    --output_dir ./llama-engine \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_seq_len 4096


import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir("./llama-engine")

outputs = runner.generate(
    batch_input_ids=input_ids,
    max_new_tokens=256,
    temperature=0.7,
)

SGLang -- Fast Structured Generation


pip install sglang[all]

# Start server
python -m sglang.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8000


import sglang as sgl

@sgl.function
def qa_pipeline(s, question):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256))

state = qa_pipeline.run(question="What is RAG?")
print(state["answer"])

Speculative Decoding

Use a small draft model to propose tokens, verified by the large model in parallel:


from vllm import LLM, SamplingParams

# vLLM with speculative decoding
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,   # Draft 5 tokens at a time
    tensor_parallel_size=4,
)

# ~2-3x latency reduction for code and structured output
outputs = llm.generate(["Write a Python function to sort a list."],
                        SamplingParams(temperature=0.0, max_tokens=256))

NVIDIA Model Optimizer


import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)

# Quantize to INT4 AWQ
model = mtq.quantize(model, mtq.INT4_AWQ_CFG, forward_loop=calibrate)

# Export for vLLM / TensorRT-LLM deployment
mtq.export(model, "quantized-model")

Memory Estimation


def estimate_memory_gb(params_billions: float, precision: str, overhead: float = 1.2) -> float:
    """Estimate GPU memory for model loading."""
    bytes_per_param = {
        "fp32": 4, "fp16": 2, "bf16": 2,
        "int8": 1, "int4": 0.5, "fp8": 1,
    }
    base = params_billions * bytes_per_param[precision]
    return base * overhead  # 20% overhead for activations, KV cache

# Examples
print(f"Llama 3.1 8B FP16:  {estimate_memory_gb(8, 'fp16'):.1f} GB")   # 19.2 GB
print(f"Llama 3.1 8B INT4:  {estimate_memory_gb(8, 'int4'):.1f} GB")   # 4.8 GB
print(f"Llama 3.1 70B FP16: {estimate_memory_gb(70, 'fp16'):.1f} GB")  # 168.0 GB
print(f"Llama 3.1 70B INT4: {estimate_memory_gb(70, 'int4'):.1f} GB")  # 42.0 GB

Hardware Sizing Guide

Model	FP16 VRAM	INT4 VRAM	Recommended GPU
3B	6 GB	2 GB	RTX 3060 (12 GB)
7-8B	16 GB	4 GB	RTX 4070 (12 GB)
13B	26 GB	7 GB	RTX 4090 (24 GB)
34B	68 GB	17 GB	A100 40 GB
70B	140 GB	35 GB	2x A100 80 GB
405B	810 GB	203 GB	8x A100 80 GB

Configuration Reference

vLLM Server Parameters

Parameter	Default	Description
`--model`	Required	HuggingFace model ID or path
`--dtype`	`auto`	Data type: auto, float16, bfloat16, float32
`--quantization`	None	Quantization method: awq, gptq, squeezellm
`--tensor-parallel-size`	1	Number of GPUs for tensor parallelism
`--gpu-memory-utilization`	0.9	Fraction of GPU memory to use
`--max-model-len`	Model default	Maximum sequence length
`--max-num-seqs`	256	Maximum concurrent sequences
`--enable-prefix-caching`	False	Cache common prompt prefixes
`--speculative-model`	None	Draft model for speculative decoding
`--num-speculative-tokens`	None	Tokens per draft step

Best Practices

Start with quantization -- INT4 quantization (AWQ or bitsandbytes NF4) gives the highest impact with the least effort, reducing memory by 75% with minimal quality loss.
Use vLLM or SGLang for production serving -- These frameworks provide continuous batching, PagedAttention, and prefix caching out of the box, dramatically outperforming naive inference.
Match quantization method to deployment target -- Use AWQ/GPTQ for GPU serving, GGUF for CPU/Apple Silicon, bitsandbytes for quick experimentation, FP8 for H100.
Enable Flash Attention everywhere -- PyTorch 2.2+ includes Flash Attention via scaled_dot_product_attention. Ensure your framework uses it for 2-4x attention speedup.
Profile before optimizing -- Use torch.profiler or nsys to identify actual bottlenecks (memory, compute, I/O) before applying optimization techniques.
Size your hardware to the quantized model -- Calculate INT4 memory requirements and add 20-30% overhead for KV cache and activations when selecting GPU instances.
Use speculative decoding for latency-sensitive tasks -- For single-request latency (not throughput), speculative decoding with a small draft model can provide 2-3x speedup.
Enable prefix caching for repeated system prompts -- If many requests share the same system prompt or few-shot examples, prefix caching avoids redundant computation.
Benchmark with realistic workloads -- Test with production-representative prompt lengths, batch sizes, and concurrency. Synthetic benchmarks often overestimate throughput.
Consider distillation for maximum efficiency -- If you control the training pipeline, distilling a large model into a smaller one often outperforms quantization of the large model.

Troubleshooting

CUDA out of memory when loading model: Reduce gpu_memory_utilization, use quantization (INT4/INT8), enable tensor parallelism across multiple GPUs, or reduce max_model_len to shrink KV cache allocation.

vLLM throughput lower than expected: Ensure enforce_eager=False to enable CUDA graphs. Increase max_num_seqs to allow more concurrent batching. Check that GPU utilization is >90% with nvidia-smi.

Quantized model produces worse output quality: Switch from INT4 to INT8 quantization. Use AWQ with calibration data instead of bitsandbytes NF4. For GGUF, use importance matrix (imatrix) quantization for better low-bit quality.

Speculative decoding not improving latency: Speculative decoding works best when the draft model's acceptance rate is high. Use a draft model from the same family. For creative/high-temperature generation, acceptance rates drop and benefits diminish.

TensorRT-LLM build fails: Verify CUDA toolkit version matches TensorRT requirements. Ensure sufficient disk space for the compiled engine (can be 2-3x model size). Check that the model architecture is supported.

⚠️ Loading Issue

Advanced Optimization Platform