Optimization Hqq Kit
Powerful skill for half, quadratic, quantization, llms. Includes structured workflows, validation checks, and reusable patterns for ai research.
HQQ - Half-Quadratic Quantization Kit
Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized inference backends. Quantize any model instantly without sample data.
When to Use
Choose HQQ when you need:
- Instant quantization without calibration data (no dataset preparation)
- Fast quantization time (minutes vs hours for GPTQ/AWQ)
- Extreme quantization experiments (2-bit, 1-bit)
- Flexible backend selection for different hardware
- Fine-tuning quantized models with LoRA/PEFT
Consider alternatives when:
- Maximum accuracy with calibration data available → use GPTQ
- Production serving with calibration-based accuracy → use AWQ
- Simple 8-bit/4-bit without custom backends → use bitsandbytes
- CPU inference or Apple Silicon deployment → use GGUF/llama.cpp
Quick Start
Installation
pip install hqq # Optional backends pip install hqq[torchao] # TorchAO int4 backend pip install hqq[marlin] # Marlin backend (Ampere+)
Quantize with HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig quantization_config = HqqConfig( nbits=4, group_size=64, axis=1 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") # Ready to use immediately — no calibration step inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Load Pre-Quantized Model
from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", device_map="auto" )
Core Concepts
Calibration-Free Approach
Unlike GPTQ and AWQ which require calibration datasets and hours of processing, HQQ uses half-quadratic optimization to find optimal quantization parameters directly from weight statistics. This means:
- No dataset preparation needed
- Quantization completes in minutes
- Works on any model architecture without special handling
- Reproducible results (no calibration data variance)
Backend Selection
HQQ supports multiple inference backends for different hardware profiles:
| Backend | Best For | Speed | Requirement |
|---|---|---|---|
pytorch | Compatibility | Baseline | Any GPU |
pytorch_compile | Moderate speedup | 1.3x | torch >= 2.0 |
aten | Good balance | 1.5x | CUDA GPU |
torchao_int4 | 4-bit inference | 2x | torchao installed |
marlin | Maximum 4-bit speed | 2.5x | Ampere+ GPU |
from hqq.core.quantize import HQQLinear # Set backend globally HQQLinear.set_backend("marlin") # Or per layer hqq_layer.set_backend("torchao_int4")
Mixed Precision Quantization
Apply different precision to different layer types for optimal quality/size tradeoff:
from transformers import AutoModelForCausalLM, HqqConfig config = HqqConfig( nbits=4, group_size=64, dynamic_config={ "attn": {"nbits": 4, "group_size": 64}, # Higher precision for attention "mlp": {"nbits": 2, "group_size": 32} # More compression for MLP } ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )
Configuration
| Parameter | Default | Description |
|---|---|---|
nbits | 4 | Bits per weight (1, 2, 3, 4, or 8) |
group_size | 64 | Weights per quantization group |
axis | 1 | Quantization axis (0=input, 1=output dimension) |
dynamic_config | None | Per-layer-type precision settings |
Group Size Guidelines
| nbits | Recommended group_size | Notes |
|---|---|---|
| 4 | 64 | Good balance of quality and compression |
| 3 | 32-64 | Smaller groups help at lower bits |
| 2 | 16-32 | Must use small groups for usable quality |
| 1 | 8-16 | Experimental, significant quality loss |
LoRA Fine-Tuning
from transformers import AutoModelForCausalLM, HqqConfig from peft import LoraConfig, get_peft_model quant_config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quant_config, device_map="auto" ) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) # Train with Trainer or custom loop
vLLM Integration
from vllm import LLM, SamplingParams llm = LLM( model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", quantization="hqq", dtype="float16" ) sampling_params = SamplingParams(temperature=0.7, max_tokens=100) outputs = llm.generate(["What is machine learning?"], sampling_params)
Best Practices
- Start with 4-bit, group_size=64 — best quality/compression tradeoff for most models
- Use Marlin backend on Ampere+ GPUs for maximum inference speed
- Apply mixed precision — keep attention layers at 4-bit, compress MLP layers to 2-bit
- Verify generation quality after quantization before deploying
- Combine with torch.compile for additional inference speedup
- Use vLLM for production serving of HQQ-quantized models
Common Issues
Out of memory during quantization:
Use device_map="sequential" to load and quantize layers one at a time instead of loading the full model.
Poor quality at 2-bit: Reduce group_size to 16. If quality is still insufficient, use mixed precision with 4-bit attention and 2-bit MLP layers.
Slow inference:
Switch to an optimized backend (marlin for Ampere+, torchao_int4 for general CUDA). Compile the model with torch.compile(model, mode="reduce-overhead").
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.