O

Optimization Gptq Kit

Enterprise-grade skill for post, training, quantization, llms. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

GPTQ Quantization Kit

Post-training quantization method that compresses large language models to 4-bit precision with minimal accuracy loss, enabling deployment of 70B+ models on consumer GPUs.

When to Use

Choose GPTQ when you need:

  • 4x memory reduction for large models (70B+) on limited GPU memory
  • Consumer GPU deployment (RTX 3090, 4090) for models that normally require multi-GPU setups
  • Fast inference with 3-4x speedup over FP16
  • Pre-quantized models from HuggingFace (1000+ available via TheBloke)

Consider alternatives when:

  • You need calibration-free quantization → use HQQ
  • You want simpler integration without custom kernels → use bitsandbytes
  • You're targeting CPU or Apple Silicon → use GGUF/llama.cpp
  • You need slightly better accuracy with newer GPUs → use AWQ

Quick Start

Installation

# Core installation pip install auto-gptq transformers accelerate # With Triton backend (Linux, faster) pip install auto-gptq[triton]

Load a Pre-Quantized Model

from transformers import AutoModelForCausalLM, AutoTokenizer # HuggingFace auto-detects GPTQ format model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-13B-Chat-GPTQ", device_map="auto", trust_remote_code=False ) tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ") inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize Your Own Model

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from transformers import AutoTokenizer from datasets import load_dataset model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) quantize_config = BaseQuantizeConfig( bits=4, group_size=128, desc_act=False, damp_percent=0.01 ) model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config ) # Prepare calibration data (128 samples recommended) dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ] model.quantize(calibration_data) model.save_quantized("llama-2-7b-gptq") tokenizer.save_pretrained("llama-2-7b-gptq")

Core Concepts

Group-Wise Quantization

GPTQ divides weight matrices into groups (typically 128 elements) and quantizes each group with its own scale and zero-point. This preserves accuracy far better than per-tensor quantization.

Weight matrix: [1024, 4096] = 4.2M elements
Group size = 128 → 32,768 groups
Each group: independent 4-bit scale + zero-point
Result: ~1.5% perplexity increase vs FP16

Kernel Backends

BackendSpeedGPU RequirementNotes
ExLlamaV2Fastest (4.2x vs FP16)Any CUDA GPUDefault, recommended
MarlinVery fast (4.8x vs FP16)Ampere+ (A100, RTX 40xx)Best for newer GPUs
TritonFast (3.5x vs FP16)Linux onlyGood alternative
CUDAGood (3.4x vs FP16)Any CUDA GPUFallback
# Use ExLlamaV2 (default, fastest for most GPUs) model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_exllama=True, exllama_config={"version": 2} ) # Use Marlin (best on Ampere+ GPUs) model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_marlin=True )

Quantization Configurations

ConfigBitsGroup Sizedesc_actMemory ReductionAccuracy Loss
Standard4128False4x~1.5%
High accuracy432True3.5x~0.8%
Aggressive3128True5x~3%

Configuration

ParameterDefaultDescription
bits4Quantization bits (3, 4, or 8)
group_size128Weights per quantization group
desc_actFalseActivation reordering (better accuracy, slower kernel)
damp_percent0.01Dampening factor for Hessian
use_exllamaTrueUse ExLlamaV2 kernel
use_marlinFalseUse Marlin kernel (Ampere+)
use_tritonFalseUse Triton kernel (Linux)

Memory Benchmarks

ModelFP16GPTQ 4-bitGPU Required
Llama 2-7B14 GB3.5 GBRTX 3060 12GB
Llama 2-13B26 GB6.5 GBRTX 4090 24GB
Llama 2-70B140 GB35 GBA100 80GB (single)
Llama 3-405B810 GB203 GB3x A100 80GB

Best Practices

  1. Start with pre-quantized models from TheBloke on HuggingFace before quantizing your own
  2. Use group_size=128 as the default — smaller groups improve accuracy but increase model size
  3. Set desc_act=False for CUDA/ExLlama kernel compatibility (set True only when accuracy is critical)
  4. Calibration data matters — use 128+ samples representative of your target domain
  5. Choose the right kernel — ExLlamaV2 for general use, Marlin for Ampere+ GPUs
  6. Verify quality after quantization by comparing perplexity against the FP16 baseline

Common Issues

CUDA out of memory during quantization: Use device_map="auto" to spread across multiple GPUs, or quantize on a machine with more VRAM and then deploy the quantized model on smaller hardware.

Slow inference despite quantization: Check which kernel backend is active. Switch to ExLlamaV2 or Marlin for best performance. Also ensure you're not running in desc_act=True mode unnecessarily.

Poor generation quality: Try a smaller group_size (64 or 32) for better accuracy. Ensure calibration data is representative of your use case. Consider using AWQ if GPTQ quality is insufficient.

Model loading errors: Ensure auto-gptq version matches the format of the quantized model. Models quantized with older versions may need inject_fuse_attention=False.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates