O

Optimization Hqq Kit

Powerful skill for half, quadratic, quantization, llms. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

HQQ - Half-Quadratic Quantization Kit

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized inference backends. Quantize any model instantly without sample data.

When to Use

Choose HQQ when you need:

  • Instant quantization without calibration data (no dataset preparation)
  • Fast quantization time (minutes vs hours for GPTQ/AWQ)
  • Extreme quantization experiments (2-bit, 1-bit)
  • Flexible backend selection for different hardware
  • Fine-tuning quantized models with LoRA/PEFT

Consider alternatives when:

  • Maximum accuracy with calibration data available → use GPTQ
  • Production serving with calibration-based accuracy → use AWQ
  • Simple 8-bit/4-bit without custom backends → use bitsandbytes
  • CPU inference or Apple Silicon deployment → use GGUF/llama.cpp

Quick Start

Installation

pip install hqq # Optional backends pip install hqq[torchao] # TorchAO int4 backend pip install hqq[marlin] # Marlin backend (Ampere+)

Quantize with HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig quantization_config = HqqConfig( nbits=4, group_size=64, axis=1 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") # Ready to use immediately — no calibration step inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Load Pre-Quantized Model

from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", device_map="auto" )

Core Concepts

Calibration-Free Approach

Unlike GPTQ and AWQ which require calibration datasets and hours of processing, HQQ uses half-quadratic optimization to find optimal quantization parameters directly from weight statistics. This means:

  • No dataset preparation needed
  • Quantization completes in minutes
  • Works on any model architecture without special handling
  • Reproducible results (no calibration data variance)

Backend Selection

HQQ supports multiple inference backends for different hardware profiles:

BackendBest ForSpeedRequirement
pytorchCompatibilityBaselineAny GPU
pytorch_compileModerate speedup1.3xtorch >= 2.0
atenGood balance1.5xCUDA GPU
torchao_int44-bit inference2xtorchao installed
marlinMaximum 4-bit speed2.5xAmpere+ GPU
from hqq.core.quantize import HQQLinear # Set backend globally HQQLinear.set_backend("marlin") # Or per layer hqq_layer.set_backend("torchao_int4")

Mixed Precision Quantization

Apply different precision to different layer types for optimal quality/size tradeoff:

from transformers import AutoModelForCausalLM, HqqConfig config = HqqConfig( nbits=4, group_size=64, dynamic_config={ "attn": {"nbits": 4, "group_size": 64}, # Higher precision for attention "mlp": {"nbits": 2, "group_size": 32} # More compression for MLP } ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )

Configuration

ParameterDefaultDescription
nbits4Bits per weight (1, 2, 3, 4, or 8)
group_size64Weights per quantization group
axis1Quantization axis (0=input, 1=output dimension)
dynamic_configNonePer-layer-type precision settings

Group Size Guidelines

nbitsRecommended group_sizeNotes
464Good balance of quality and compression
332-64Smaller groups help at lower bits
216-32Must use small groups for usable quality
18-16Experimental, significant quality loss

LoRA Fine-Tuning

from transformers import AutoModelForCausalLM, HqqConfig from peft import LoraConfig, get_peft_model quant_config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quant_config, device_map="auto" ) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) # Train with Trainer or custom loop

vLLM Integration

from vllm import LLM, SamplingParams llm = LLM( model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", quantization="hqq", dtype="float16" ) sampling_params = SamplingParams(temperature=0.7, max_tokens=100) outputs = llm.generate(["What is machine learning?"], sampling_params)

Best Practices

  1. Start with 4-bit, group_size=64 — best quality/compression tradeoff for most models
  2. Use Marlin backend on Ampere+ GPUs for maximum inference speed
  3. Apply mixed precision — keep attention layers at 4-bit, compress MLP layers to 2-bit
  4. Verify generation quality after quantization before deploying
  5. Combine with torch.compile for additional inference speedup
  6. Use vLLM for production serving of HQQ-quantized models

Common Issues

Out of memory during quantization: Use device_map="sequential" to load and quantize layers one at a time instead of loading the full model.

Poor quality at 2-bit: Reduce group_size to 16. If quality is still insufficient, use mixed precision with 4-bit attention and 2-bit MLP layers.

Slow inference: Switch to an optimized backend (marlin for Ampere+, torchao_int4 for general CUDA). Compile the model with torch.compile(model, mode="reduce-overhead").

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates