bitsandbytes -- LLM Quantization and Memory Optimization

Overview

A comprehensive skill for reducing LLM memory requirements using bitsandbytes quantization. bitsandbytes enables loading and running large language models in 8-bit (INT8) or 4-bit (NF4/FP4) precision directly through the HuggingFace Transformers library, reducing GPU memory by 50-75% with less than 1% accuracy loss. It also provides 8-bit optimizers that reduce optimizer state memory by 75% during training. This skill covers inference quantization, QLoRA fine-tuning (4-bit base model + LoRA adapters), and 8-bit optimizer integration -- the three main workflows that make bitsandbytes essential for running and training large models on consumer and datacenter GPUs.

When to Use

Loading a model that is too large for your GPU in full precision
Running 7B-70B parameter models on consumer GPUs (8-24 GB VRAM)
Fine-tuning large models with QLoRA on limited hardware
Reducing optimizer memory during full-precision training
Quick experimentation that does not require calibration data (unlike AWQ/GPTQ)
Need HuggingFace Transformers integration with minimal code changes

Quick Start


pip install bitsandbytes transformers accelerate torch


from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit quantization (75% memory reduction)
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Inference (model uses ~4.5 GB instead of ~16 GB)
inputs = tokenizer("Explain quantum computing simply:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Core Concepts

Quantization Levels

Level	Memory Savings	Quality Impact	Config Key
8-bit (INT8)	~50%	<0.5% loss	`load_in_8bit=True`
4-bit (NF4)	~75%	<1% loss	`load_in_4bit=True`
4-bit + Double Quant	~78%	<1% loss	`bnb_4bit_use_double_quant=True`

Memory Estimation

FP16:  Parameters x 2 bytes
INT8:  Parameters x 1 byte
INT4:  Parameters x 0.5 bytes

Llama 3.1 8B:
  FP16:  8B x 2 = 16 GB
  INT8:  8B x 1 = 8 GB
  INT4:  8B x 0.5 = 4 GB (+ overhead ~ 4.5 GB total)

Llama 3.1 70B:
  FP16:  70B x 2 = 140 GB
  INT8:  70B x 1 = 70 GB
  INT4:  70B x 0.5 = 35 GB (+ overhead ~ 38 GB total)

GPU Sizing Guide

GPU VRAM	Model Size (4-bit)	Model Size (8-bit)
8 GB	Up to 3B	Up to 3B
12 GB	Up to 7-8B	Up to 3B
16 GB	Up to 13B	Up to 7-8B
24 GB	Up to 34B	Up to 13B
40 GB	Up to 70B	Up to 34B
80 GB	Up to 70B	Up to 70B

8-bit Quantization Configuration


from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,      # Outlier threshold for mixed-precision
    llm_int8_has_fp16_weight=False,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=config,
    device_map="auto",
)

4-bit Quantization Configuration


import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,    # Compute in FP16 for speed
    bnb_4bit_quant_type="nf4",               # NormalFloat4 (better than FP4)
    bnb_4bit_use_double_quant=True,           # Quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=config,
    device_map="auto",
    torch_dtype=torch.float16,
)

Configuration Parameters

Parameter	Type	Default	Description
`load_in_8bit`	bool	False	Enable 8-bit quantization
`load_in_4bit`	bool	False	Enable 4-bit quantization
`bnb_4bit_compute_dtype`	dtype	float32	Computation dtype (use float16 or bfloat16)
`bnb_4bit_quant_type`	str	"fp4"	Quantization type: "nf4" (recommended) or "fp4"
`bnb_4bit_use_double_quant`	bool	False	Nested quantization for extra savings
`llm_int8_threshold`	float	6.0	Outlier detection threshold for 8-bit
`llm_int8_has_fp16_weight`	bool	False	Store weights in FP16 (increases memory)
`llm_int8_skip_modules`	list	None	Modules to keep in full precision

QLoRA Fine-Tuning

QLoRA combines 4-bit quantized base model with trainable LoRA adapters for memory-efficient fine-tuning:


pip install bitsandbytes transformers peft accelerate datasets trl


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

# Step 1: Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

# Step 2: Prepare model and add LoRA
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,                                              # LoRA rank
    lora_alpha=32,                                     # LoRA alpha scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~6.5M || all params: ~8B || trainable%: 0.08%

# Step 3: Train with SFTTrainer
training_config = SFTConfig(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=2048,
)

trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=train_dataset,
    processing_class=tokenizer,
)

trainer.train()

# Step 4: Save adapters (only ~20-50 MB)
model.save_pretrained("./qlora-adapters")
tokenizer.save_pretrained("./qlora-adapters")

Loading QLoRA Adapters for Inference


from peft import PeftModel
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Reload base model in 4-bit
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_quant_type="nf4",
    ),
    device_map="auto",
)

# Load LoRA adapters on top
model = PeftModel.from_pretrained(base_model, "./qlora-adapters")

# Inference with fine-tuned model
inputs = tokenizer("Your custom prompt:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)

8-bit Optimizers

Reduce optimizer state memory by 75% during training:


import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments

# Option 1: Via Trainer (simplest)
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=8,
    optim="paged_adamw_8bit",  # 8-bit paged optimizer
    learning_rate=5e-5,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)
trainer.train()

# Option 2: Manual optimizer
optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    eps=1e-8,
)

# Memory savings:
# Standard AdamW: parameters x 8 bytes (2 states per param)
# 8-bit AdamW:    parameters x 2 bytes
# Llama 8B: 56 GB -> 14 GB optimizer memory

Verifying Quantization


import torch

# Check memory after loading
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"GPU memory reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Check model dtype
for name, param in model.named_parameters():
    print(f"{name}: {param.dtype}, shape: {param.shape}")
    break  # Just check first parameter

# Check quantization is active
print(f"Is quantized: {hasattr(model.config, 'quantization_config')}")
if hasattr(model.config, 'quantization_config'):
    print(f"Quantization: {model.config.quantization_config.to_dict()}")

Best Practices

Use NF4 over FP4 for 4-bit quantization -- NormalFloat4 is information-theoretically optimal for normally distributed weights and consistently outperforms FP4 in benchmarks.
Always set bnb_4bit_compute_dtype=torch.float16 -- The default float32 compute dtype negates much of the speed benefit. Use float16 or bfloat16 for computation.
Enable double quantization for maximum memory savings -- bnb_4bit_use_double_quant=True quantizes the quantization constants themselves, saving an additional ~3% memory.
Use device_map="auto" for automatic placement -- Let Accelerate handle multi-GPU and CPU offload placement rather than manually assigning devices.
Prefer 4-bit for inference, 8-bit when quality is critical -- 4-bit is sufficient for most generative tasks. Use 8-bit when you need near-FP16 quality for sensitive classification or reasoning.
Set pad_token before QLoRA training -- Many models lack a pad token. Set tokenizer.pad_token = tokenizer.eos_token to avoid training errors.
Target all linear layers for QLoRA -- Include q_proj, k_proj, v_proj, o_proj, and MLP projections (gate_proj, up_proj, down_proj) for best fine-tuning quality.
Use paged optimizers for training -- paged_adamw_8bit handles GPU memory spikes gracefully by paging optimizer states to CPU when VRAM is exhausted.
Do not quantize embedding or output layers -- Use llm_int8_skip_modules to keep embedding and lm_head layers in full precision for better quality.
Benchmark with your specific task -- Quantization impact varies by task. Evaluate on your actual downstream task (not just perplexity) to confirm acceptable quality.

Troubleshooting

ImportError: libbitsandbytes not found: Install with pip install bitsandbytes. On Linux, ensure CUDA toolkit is installed. Check CUDA version compatibility with python -m bitsandbytes.

Model loads but inference is slow: Verify bnb_4bit_compute_dtype is set to torch.float16, not the default torch.float32. The default causes all computations to run in full precision.

CUDA out of memory despite quantization: The KV cache and activations are still in FP16. Reduce max_new_tokens, batch size, or context length. Use torch.cuda.empty_cache() between batches.

QLoRA training loss is NaN or diverges: Lower the learning rate to 1e-4 or 5e-5. Ensure gradient accumulation steps are set correctly. Check that pad_token is properly configured.

Quantized model gives different results on different GPUs: bitsandbytes quantization is not deterministic across GPU architectures. Small differences are expected between e.g. A100 and RTX 4090. This does not affect overall quality.

⚠️ Loading Issue

Optimization Bitsandbytes Expert