Optimization Gptq Kit
Enterprise-grade skill for post, training, quantization, llms. Includes structured workflows, validation checks, and reusable patterns for ai research.
GPTQ Quantization Kit
Post-training quantization method that compresses large language models to 4-bit precision with minimal accuracy loss, enabling deployment of 70B+ models on consumer GPUs.
When to Use
Choose GPTQ when you need:
- 4x memory reduction for large models (70B+) on limited GPU memory
- Consumer GPU deployment (RTX 3090, 4090) for models that normally require multi-GPU setups
- Fast inference with 3-4x speedup over FP16
- Pre-quantized models from HuggingFace (1000+ available via TheBloke)
Consider alternatives when:
- You need calibration-free quantization → use HQQ
- You want simpler integration without custom kernels → use bitsandbytes
- You're targeting CPU or Apple Silicon → use GGUF/llama.cpp
- You need slightly better accuracy with newer GPUs → use AWQ
Quick Start
Installation
# Core installation pip install auto-gptq transformers accelerate # With Triton backend (Linux, faster) pip install auto-gptq[triton]
Load a Pre-Quantized Model
from transformers import AutoModelForCausalLM, AutoTokenizer # HuggingFace auto-detects GPTQ format model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-13B-Chat-GPTQ", device_map="auto", trust_remote_code=False ) tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ") inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantize Your Own Model
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from transformers import AutoTokenizer from datasets import load_dataset model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) quantize_config = BaseQuantizeConfig( bits=4, group_size=128, desc_act=False, damp_percent=0.01 ) model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config ) # Prepare calibration data (128 samples recommended) dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ] model.quantize(calibration_data) model.save_quantized("llama-2-7b-gptq") tokenizer.save_pretrained("llama-2-7b-gptq")
Core Concepts
Group-Wise Quantization
GPTQ divides weight matrices into groups (typically 128 elements) and quantizes each group with its own scale and zero-point. This preserves accuracy far better than per-tensor quantization.
Weight matrix: [1024, 4096] = 4.2M elements
Group size = 128 → 32,768 groups
Each group: independent 4-bit scale + zero-point
Result: ~1.5% perplexity increase vs FP16
Kernel Backends
| Backend | Speed | GPU Requirement | Notes |
|---|---|---|---|
| ExLlamaV2 | Fastest (4.2x vs FP16) | Any CUDA GPU | Default, recommended |
| Marlin | Very fast (4.8x vs FP16) | Ampere+ (A100, RTX 40xx) | Best for newer GPUs |
| Triton | Fast (3.5x vs FP16) | Linux only | Good alternative |
| CUDA | Good (3.4x vs FP16) | Any CUDA GPU | Fallback |
# Use ExLlamaV2 (default, fastest for most GPUs) model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_exllama=True, exllama_config={"version": 2} ) # Use Marlin (best on Ampere+ GPUs) model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_marlin=True )
Quantization Configurations
| Config | Bits | Group Size | desc_act | Memory Reduction | Accuracy Loss |
|---|---|---|---|---|---|
| Standard | 4 | 128 | False | 4x | ~1.5% |
| High accuracy | 4 | 32 | True | 3.5x | ~0.8% |
| Aggressive | 3 | 128 | True | 5x | ~3% |
Configuration
| Parameter | Default | Description |
|---|---|---|
bits | 4 | Quantization bits (3, 4, or 8) |
group_size | 128 | Weights per quantization group |
desc_act | False | Activation reordering (better accuracy, slower kernel) |
damp_percent | 0.01 | Dampening factor for Hessian |
use_exllama | True | Use ExLlamaV2 kernel |
use_marlin | False | Use Marlin kernel (Ampere+) |
use_triton | False | Use Triton kernel (Linux) |
Memory Benchmarks
| Model | FP16 | GPTQ 4-bit | GPU Required |
|---|---|---|---|
| Llama 2-7B | 14 GB | 3.5 GB | RTX 3060 12GB |
| Llama 2-13B | 26 GB | 6.5 GB | RTX 4090 24GB |
| Llama 2-70B | 140 GB | 35 GB | A100 80GB (single) |
| Llama 3-405B | 810 GB | 203 GB | 3x A100 80GB |
Best Practices
- Start with pre-quantized models from TheBloke on HuggingFace before quantizing your own
- Use group_size=128 as the default — smaller groups improve accuracy but increase model size
- Set desc_act=False for CUDA/ExLlama kernel compatibility (set True only when accuracy is critical)
- Calibration data matters — use 128+ samples representative of your target domain
- Choose the right kernel — ExLlamaV2 for general use, Marlin for Ampere+ GPUs
- Verify quality after quantization by comparing perplexity against the FP16 baseline
Common Issues
CUDA out of memory during quantization:
Use device_map="auto" to spread across multiple GPUs, or quantize on a machine with more VRAM and then deploy the quantized model on smaller hardware.
Slow inference despite quantization: Check which kernel backend is active. Switch to ExLlamaV2 or Marlin for best performance. Also ensure you're not running in desc_act=True mode unnecessarily.
Poor generation quality: Try a smaller group_size (64 or 32) for better accuracy. Ensure calibration data is representative of your use case. Consider using AWQ if GPTQ quality is insufficient.
Model loading errors:
Ensure auto-gptq version matches the format of the quantized model. Models quantized with older versions may need inject_fuse_attention=False.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.