GGUF -- Quantized Model Format for llama.cpp

Overview

A comprehensive skill for working with the GGUF (GPT-Generated Unified Format) file format, the standard for running quantized LLMs with llama.cpp. GGUF enables efficient inference on consumer hardware -- CPUs, Apple Silicon with Metal acceleration, NVIDIA GPUs with CUDA, and AMD GPUs with ROCm -- without requiring a Python runtime or deep learning framework. The format supports flexible quantization from 2-bit to 8-bit using K-quant methods, with optional importance matrix (imatrix) calibration for better low-bit quality. GGUF is the backbone of local AI applications including Ollama, LM Studio, text-generation-webui, and koboldcpp. This skill covers model conversion, quantization, inference with llama.cpp and llama-cpp-python, server deployment, and hardware optimization.

When to Use

Running LLMs on consumer laptops, desktops, or Mac hardware without a dedicated GPU
Deploying on Apple Silicon (M1/M2/M3/M4) with Metal GPU acceleration
Need CPU inference without Python or PyTorch dependencies
Want flexible quantization levels (Q2_K through Q8_0) to trade off quality vs size
Using local AI tools (Ollama, LM Studio, text-generation-webui)
Deploying lightweight, self-contained inference servers with OpenAI-compatible APIs
Running offline or air-gapped inference without cloud dependencies

Quick Start


# Clone and build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build for CPU
make

# Build for Apple Silicon (Metal)
make GGML_METAL=1

# Build for NVIDIA GPU (CUDA)
make GGML_CUDA=1


# Download a pre-quantized GGUF model
huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf \
    --local-dir ./models

# Run inference
./llama-cli -m ./models/llama-2-7b.Q4_K_M.gguf \
    -p "Explain recursion in simple terms:" \
    -n 200

Core Concepts

Quantization Types (K-Quants)

K-quant methods use per-block scaling and mixed precision for optimal quality at each bit level:

Type	Avg Bits	Size (7B)	Size (70B)	Quality	Recommended Use
Q2_K	2.5	~2.8 GB	~28 GB	Low	Extreme memory constraints
Q3_K_S	3.0	~3.0 GB	~30 GB	Low-Medium	Tight memory budget
Q3_K_M	3.3	~3.3 GB	~33 GB	Medium	Memory-constrained balance
Q4_K_S	4.0	~3.8 GB	~38 GB	Medium-High	Good balance, smaller
Q4_K_M	4.5	~4.1 GB	~41 GB	High	Recommended default
Q5_K_S	5.0	~4.6 GB	~46 GB	High	Quality-focused
Q5_K_M	5.5	~4.8 GB	~48 GB	Very High	High quality, more memory
Q6_K	6.0	~5.5 GB	~55 GB	Excellent	Near-original quality
Q8_0	8.0	~7.2 GB	~72 GB	Near-Perfect	Maximum quality quant

Rule of thumb: Q4_K_M offers the best quality-to-size ratio for most use cases. Use Q5_K_M or Q6_K if you have the memory budget and need higher fidelity.

Conversion Pipeline

HuggingFace Model (FP16/BF16 safetensors)
        │
        ▼
   convert_hf_to_gguf.py ──► GGUF (F16 or F32)
        │
        ▼
   llama-quantize ──► GGUF (Q4_K_M, Q5_K_M, etc.)
        │
        ├── Optional: llama-imatrix ──► importance matrix
        │   └── llama-quantize --imatrix ──► Better low-bit quality
        │
        ▼
   Deploy: llama-cli / llama-server / Ollama / LM Studio

Model Conversion

HuggingFace to GGUF


# Step 1: Download model
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
    --local-dir ./llama-3.1-8b

# Step 2: Convert to GGUF (FP16 baseline)
cd llama.cpp
pip install -r requirements.txt

python convert_hf_to_gguf.py ../llama-3.1-8b \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16

# Step 3: Quantize to Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

# Step 4: Verify
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Importance Matrix Quantization (Better Quality)

The importance matrix measures which weights matter most, allowing the quantizer to allocate precision where it counts:


# Step 1: Create calibration dataset (diverse text samples)
cat > calibration.txt << 'EOF'
Machine learning is a subset of artificial intelligence that enables systems to learn from data.
The quick brown fox jumps over the lazy dog.
In quantum computing, qubits can exist in superposition of states.
Python is a popular programming language known for its readable syntax.
The mitochondria is the powerhouse of the cell.
EOF

# Step 2: Generate importance matrix
./llama-imatrix -m llama-3.1-8b-f16.gguf \
    -f calibration.txt \
    --chunk 512 \
    -o llama-3.1-8b.imatrix \
    -ngl 35  # GPU layers for faster processing

# Step 3: Quantize with importance matrix
./llama-quantize --imatrix llama-3.1-8b.imatrix \
    llama-3.1-8b-f16.gguf \
    llama-3.1-8b-q4_k_m-imat.gguf \
    Q4_K_M

Batch Quantization Script


#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
PREFIX="llama-3.1-8b"

# Generate imatrix once
./llama-imatrix -m "$MODEL" -f wiki.txt -o "$IMATRIX" -ngl 35

# Create multiple quantization levels
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUTPUT="${PREFIX}-${QUANT,,}.gguf"
    echo "Creating $OUTPUT..."
    ./llama-quantize --imatrix "$IMATRIX" "$MODEL" "$OUTPUT" "$QUANT"
    echo "  Size: $(du -h "$OUTPUT" | cut -f1)"
done

Python Integration (llama-cpp-python)


# Install with CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

# Install with Metal support (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

# CPU only
pip install llama-cpp-python

Basic Inference


from llama_cpp import Llama

llm = Llama(
    model_path="./models/llama-3.1-8b-q4_k_m.gguf",
    n_ctx=8192,          # Context window size
    n_gpu_layers=35,     # Layers to offload to GPU (-1 = all)
    n_threads=8,         # CPU threads (for non-offloaded layers)
    verbose=False,
)

# Text completion
output = llm(
    "Explain the difference between TCP and UDP:",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stop=["</s>", "\n\n\n"],
)
print(output["choices"][0]["text"])

Chat Completion


from llama_cpp import Llama

llm = Llama(
    model_path="./models/llama-3.1-8b-q4_k_m.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,              # Offload all layers
    chat_format="llama-3",        # Match model's chat template
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "What is a hash table?"},
    ],
    max_tokens=256,
    temperature=0.7,
)
print(response["choices"][0]["message"]["content"])

Streaming Output


for chunk in llm.create_chat_completion(
    messages=[{"role": "user", "content": "Write a haiku about programming."}],
    max_tokens=100,
    stream=True,
):
    delta = chunk["choices"][0]["delta"]
    if "content" in delta:
        print(delta["content"], end="", flush=True)

Server Deployment

llama.cpp Server (OpenAI-Compatible)


# Start server
./llama-server \
    -m models/llama-3.1-8b-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 8192 \
    --n-predict 512 \
    -t 8

# Or with Python bindings
python -m llama_cpp.server \
    --model models/llama-3.1-8b-q4_k_m.gguf \
    --n_gpu_layers 35 \
    --host 0.0.0.0 \
    --port 8080

Using with OpenAI Client


from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Docker containers briefly."},
    ],
    max_tokens=256,
    temperature=0.7,
)
print(response.choices[0].message.content)

Ollama Integration


# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./llama-3.1-8b-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM "You are a helpful technical assistant."
EOF

# Create and run model
ollama create my-llama -f Modelfile
ollama run my-llama "What is Kubernetes?"

Hardware Optimization

Apple Silicon (Metal)


# Build with Metal
make clean && make GGML_METAL=1

# Run with full GPU offload
./llama-cli -m model.gguf -ngl 99 -p "Hello!"


# Python with Metal
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,     # Offload all layers to Metal
    n_threads=1,         # Metal handles parallelism
    use_mmap=True,       # Memory-mapped file loading
)

NVIDIA GPU (CUDA)


# Build with CUDA
make clean && make GGML_CUDA=1

# Partial offload (when model exceeds VRAM)
./llama-cli -m model.gguf -ngl 20 -p "Hello!"
# First 20 layers on GPU, rest on CPU

CPU Optimization


# Detect optimal thread count
nproc  # Linux
sysctl -n hw.ncpu  # macOS

# Run with optimal threading
./llama-cli -m model.gguf -ngl 0 -t 8 -p "Hello!"

Configuration Reference

llama-cli Parameters

Parameter	Default	Description
`-m`	Required	Path to GGUF model file
`-p`	None	Prompt text
`-n`	128	Maximum tokens to generate
`-ngl`	0	Number of layers to offload to GPU
`-c`	2048	Context window size
`-t`	Auto	Number of CPU threads
`-b`	512	Batch size for prompt processing
`--temp`	0.8	Sampling temperature
`--top-k`	40	Top-K sampling
`--top-p`	0.9	Top-P (nucleus) sampling
`--repeat-penalty`	1.1	Repetition penalty
`--mlock`	Off	Lock model in RAM (prevent swap)
`--mmap`	On	Memory-map model file

llama-server Parameters

Parameter	Default	Description
`--host`	127.0.0.1	Listen address
`--port`	8080	Listen port
`-ngl`	0	GPU layers to offload
`-c`	2048	Context window
`--n-predict`	-1	Default max tokens (-1 = unlimited)
`-t`	Auto	CPU threads
`--parallel`	1	Number of parallel request slots
`--embedding`	Off	Enable embedding endpoint
`--cont-batching`	Off	Enable continuous batching

Best Practices

Default to Q4_K_M quantization -- It offers the best quality-to-size ratio for general use. Only go lower (Q3_K_M) if memory is truly constrained.
Use importance matrix for low-bit quantization -- Below Q4, the quality difference between standard and imatrix quantization is significant. Always use imatrix for Q2_K and Q3_K variants.
Offload as many layers to GPU as possible -- Each GPU-offloaded layer provides substantial speedup. Use -ngl -1 to offload everything, or find the maximum that fits in VRAM.
Match chat format to the model -- Set chat_format correctly in llama-cpp-python (e.g., "llama-3", "chatml", "mistral"). Wrong format produces garbled or repetitive output.
Use --mmap for fast model loading -- Memory-mapped loading is the default and dramatically reduces startup time compared to reading the entire file into memory.
Set context window appropriately -- Larger context windows consume proportionally more memory for KV cache. Set -c to the maximum you actually need, not the model's limit.
Pre-quantize to multiple levels -- Create Q4_K_M, Q5_K_M, and Q8_0 variants from one F16 base. Users can then select based on their hardware constraints.
Use Ollama for deployment simplicity -- For production serving, Ollama handles model management, API serving, and hardware detection automatically.
Monitor generation speed -- llama.cpp reports tokens/second. On Apple Silicon M2, expect 15-30 tok/s for Q4_K_M 7B. On CPU-only, expect 3-10 tok/s depending on cores.
Keep the F16 GGUF as your source of truth -- Always archive the FP16 GGUF file. You can re-quantize to any level from it, but cannot recover precision from a quantized file.

Troubleshooting

Model loads but generation is very slow: Check that GPU offloading is enabled (-ngl > 0). On Apple Silicon, build with GGML_METAL=1. Verify thread count matches physical cores, not logical cores.

Output is garbled or repetitive: Wrong chat template format. Set chat_format to match the model family. For Llama 3, use "llama-3". For Mistral, use "mistral-instruct". Check the model card on HuggingFace.

CUDA out of memory when offloading layers: Reduce -ngl to offload fewer layers. The remaining layers run on CPU. Find the sweet spot by binary search: try -ngl 20, then adjust up or down.

convert_hf_to_gguf.py fails with unsupported model: Not all architectures are supported. Check llama.cpp's supported model list. For newer architectures, update to the latest llama.cpp version.

llama-server returns empty responses: Check that --n-predict is not set to 0. Verify the prompt format matches the model's expected template. Test with llama-cli first to confirm the model works.

⚠️ Loading Issue

Pro Optimization Gguf