GPTQ Quantization Kit

Post-training quantization method that compresses large language models to 4-bit precision with minimal accuracy loss, enabling deployment of 70B+ models on consumer GPUs.

When to Use

Choose GPTQ when you need:

4x memory reduction for large models (70B+) on limited GPU memory
Consumer GPU deployment (RTX 3090, 4090) for models that normally require multi-GPU setups
Fast inference with 3-4x speedup over FP16
Pre-quantized models from HuggingFace (1000+ available via TheBloke)

Consider alternatives when:

You need calibration-free quantization → use HQQ
You want simpler integration without custom kernels → use bitsandbytes
You're targeting CPU or Apple Silicon → use GGUF/llama.cpp
You need slightly better accuracy with newer GPUs → use AWQ

Quick Start

Installation


# Core installation
pip install auto-gptq transformers accelerate

# With Triton backend (Linux, faster)
pip install auto-gptq[triton]

Load a Pre-Quantized Model


from transformers import AutoModelForCausalLM, AutoTokenizer

# HuggingFace auto-detects GPTQ format
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-13B-Chat-GPTQ",
    device_map="auto",
    trust_remote_code=False
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize Your Own Model


from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
    damp_percent=0.01
)

model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config
)

# Prepare calibration data (128 samples recommended)
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
    tokenizer(example["text"])["input_ids"][:512]
    for example in dataset.take(128)
]

model.quantize(calibration_data)
model.save_quantized("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")

Core Concepts

Group-Wise Quantization

GPTQ divides weight matrices into groups (typically 128 elements) and quantizes each group with its own scale and zero-point. This preserves accuracy far better than per-tensor quantization.

Weight matrix: [1024, 4096] = 4.2M elements
Group size = 128 → 32,768 groups
Each group: independent 4-bit scale + zero-point
Result: ~1.5% perplexity increase vs FP16

Kernel Backends

Backend	Speed	GPU Requirement	Notes
ExLlamaV2	Fastest (4.2x vs FP16)	Any CUDA GPU	Default, recommended
Marlin	Very fast (4.8x vs FP16)	Ampere+ (A100, RTX 40xx)	Best for newer GPUs
Triton	Fast (3.5x vs FP16)	Linux only	Good alternative
CUDA	Good (3.4x vs FP16)	Any CUDA GPU	Fallback


# Use ExLlamaV2 (default, fastest for most GPUs)
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_exllama=True,
    exllama_config={"version": 2}
)

# Use Marlin (best on Ampere+ GPUs)
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_marlin=True
)

Quantization Configurations

Config	Bits	Group Size	desc_act	Memory Reduction	Accuracy Loss
Standard	4	128	False	4x	~1.5%
High accuracy	4	32	True	3.5x	~0.8%
Aggressive	3	128	True	5x	~3%

Configuration

Parameter	Default	Description
`bits`	4	Quantization bits (3, 4, or 8)
`group_size`	128	Weights per quantization group
`desc_act`	False	Activation reordering (better accuracy, slower kernel)
`damp_percent`	0.01	Dampening factor for Hessian
`use_exllama`	True	Use ExLlamaV2 kernel
`use_marlin`	False	Use Marlin kernel (Ampere+)
`use_triton`	False	Use Triton kernel (Linux)

Memory Benchmarks

Model	FP16	GPTQ 4-bit	GPU Required
Llama 2-7B	14 GB	3.5 GB	RTX 3060 12GB
Llama 2-13B	26 GB	6.5 GB	RTX 4090 24GB
Llama 2-70B	140 GB	35 GB	A100 80GB (single)
Llama 3-405B	810 GB	203 GB	3x A100 80GB

Best Practices

Start with pre-quantized models from TheBloke on HuggingFace before quantizing your own
Use group_size=128 as the default — smaller groups improve accuracy but increase model size
Set desc_act=False for CUDA/ExLlama kernel compatibility (set True only when accuracy is critical)
Calibration data matters — use 128+ samples representative of your target domain
Choose the right kernel — ExLlamaV2 for general use, Marlin for Ampere+ GPUs
Verify quality after quantization by comparing perplexity against the FP16 baseline

Common Issues

CUDA out of memory during quantization: Use device_map="auto" to spread across multiple GPUs, or quantize on a machine with more VRAM and then deploy the quantized model on smaller hardware.

Slow inference despite quantization: Check which kernel backend is active. Switch to ExLlamaV2 or Marlin for best performance. Also ensure you're not running in desc_act=True mode unnecessarily.

Poor generation quality: Try a smaller group_size (64 or 32) for better accuracy. Ensure calibration data is representative of your use case. Consider using AWQ if GPTQ quality is insufficient.

Model loading errors: Ensure auto-gptq version matches the format of the quantized model. Models quantized with older versions may need inject_fuse_attention=False.

⚠️ Loading Issue

Optimization Gptq Kit