P

Pro Optimization Gguf

Production-ready skill that handles gguf, format, llama, quantization. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

GGUF -- Quantized Model Format for llama.cpp

Overview

A comprehensive skill for working with the GGUF (GPT-Generated Unified Format) file format, the standard for running quantized LLMs with llama.cpp. GGUF enables efficient inference on consumer hardware -- CPUs, Apple Silicon with Metal acceleration, NVIDIA GPUs with CUDA, and AMD GPUs with ROCm -- without requiring a Python runtime or deep learning framework. The format supports flexible quantization from 2-bit to 8-bit using K-quant methods, with optional importance matrix (imatrix) calibration for better low-bit quality. GGUF is the backbone of local AI applications including Ollama, LM Studio, text-generation-webui, and koboldcpp. This skill covers model conversion, quantization, inference with llama.cpp and llama-cpp-python, server deployment, and hardware optimization.

When to Use

  • Running LLMs on consumer laptops, desktops, or Mac hardware without a dedicated GPU
  • Deploying on Apple Silicon (M1/M2/M3/M4) with Metal GPU acceleration
  • Need CPU inference without Python or PyTorch dependencies
  • Want flexible quantization levels (Q2_K through Q8_0) to trade off quality vs size
  • Using local AI tools (Ollama, LM Studio, text-generation-webui)
  • Deploying lightweight, self-contained inference servers with OpenAI-compatible APIs
  • Running offline or air-gapped inference without cloud dependencies

Quick Start

# Clone and build llama.cpp git clone https://github.com/ggml-org/llama.cpp cd llama.cpp # Build for CPU make # Build for Apple Silicon (Metal) make GGML_METAL=1 # Build for NVIDIA GPU (CUDA) make GGML_CUDA=1
# Download a pre-quantized GGUF model huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf \ --local-dir ./models # Run inference ./llama-cli -m ./models/llama-2-7b.Q4_K_M.gguf \ -p "Explain recursion in simple terms:" \ -n 200

Core Concepts

Quantization Types (K-Quants)

K-quant methods use per-block scaling and mixed precision for optimal quality at each bit level:

TypeAvg BitsSize (7B)Size (70B)QualityRecommended Use
Q2_K2.5~2.8 GB~28 GBLowExtreme memory constraints
Q3_K_S3.0~3.0 GB~30 GBLow-MediumTight memory budget
Q3_K_M3.3~3.3 GB~33 GBMediumMemory-constrained balance
Q4_K_S4.0~3.8 GB~38 GBMedium-HighGood balance, smaller
Q4_K_M4.5~4.1 GB~41 GBHighRecommended default
Q5_K_S5.0~4.6 GB~46 GBHighQuality-focused
Q5_K_M5.5~4.8 GB~48 GBVery HighHigh quality, more memory
Q6_K6.0~5.5 GB~55 GBExcellentNear-original quality
Q8_08.0~7.2 GB~72 GBNear-PerfectMaximum quality quant

Rule of thumb: Q4_K_M offers the best quality-to-size ratio for most use cases. Use Q5_K_M or Q6_K if you have the memory budget and need higher fidelity.

Conversion Pipeline

HuggingFace Model (FP16/BF16 safetensors)
        │
        ▼
   convert_hf_to_gguf.py ──► GGUF (F16 or F32)
        │
        ▼
   llama-quantize ──► GGUF (Q4_K_M, Q5_K_M, etc.)
        │
        ├── Optional: llama-imatrix ──► importance matrix
        │   └── llama-quantize --imatrix ──► Better low-bit quality
        │
        ▼
   Deploy: llama-cli / llama-server / Ollama / LM Studio

Model Conversion

HuggingFace to GGUF

# Step 1: Download model huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \ --local-dir ./llama-3.1-8b # Step 2: Convert to GGUF (FP16 baseline) cd llama.cpp pip install -r requirements.txt python convert_hf_to_gguf.py ../llama-3.1-8b \ --outfile llama-3.1-8b-f16.gguf \ --outtype f16 # Step 3: Quantize to Q4_K_M ./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M # Step 4: Verify ./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Importance Matrix Quantization (Better Quality)

The importance matrix measures which weights matter most, allowing the quantizer to allocate precision where it counts:

# Step 1: Create calibration dataset (diverse text samples) cat > calibration.txt << 'EOF' Machine learning is a subset of artificial intelligence that enables systems to learn from data. The quick brown fox jumps over the lazy dog. In quantum computing, qubits can exist in superposition of states. Python is a popular programming language known for its readable syntax. The mitochondria is the powerhouse of the cell. EOF # Step 2: Generate importance matrix ./llama-imatrix -m llama-3.1-8b-f16.gguf \ -f calibration.txt \ --chunk 512 \ -o llama-3.1-8b.imatrix \ -ngl 35 # GPU layers for faster processing # Step 3: Quantize with importance matrix ./llama-quantize --imatrix llama-3.1-8b.imatrix \ llama-3.1-8b-f16.gguf \ llama-3.1-8b-q4_k_m-imat.gguf \ Q4_K_M

Batch Quantization Script

#!/bin/bash MODEL="llama-3.1-8b-f16.gguf" IMATRIX="llama-3.1-8b.imatrix" PREFIX="llama-3.1-8b" # Generate imatrix once ./llama-imatrix -m "$MODEL" -f wiki.txt -o "$IMATRIX" -ngl 35 # Create multiple quantization levels for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="${PREFIX}-${QUANT,,}.gguf" echo "Creating $OUTPUT..." ./llama-quantize --imatrix "$IMATRIX" "$MODEL" "$OUTPUT" "$QUANT" echo " Size: $(du -h "$OUTPUT" | cut -f1)" done

Python Integration (llama-cpp-python)

# Install with CUDA support CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python # Install with Metal support (macOS) CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python # CPU only pip install llama-cpp-python

Basic Inference

from llama_cpp import Llama llm = Llama( model_path="./models/llama-3.1-8b-q4_k_m.gguf", n_ctx=8192, # Context window size n_gpu_layers=35, # Layers to offload to GPU (-1 = all) n_threads=8, # CPU threads (for non-offloaded layers) verbose=False, ) # Text completion output = llm( "Explain the difference between TCP and UDP:", max_tokens=256, temperature=0.7, top_p=0.9, stop=["</s>", "\n\n\n"], ) print(output["choices"][0]["text"])

Chat Completion

from llama_cpp import Llama llm = Llama( model_path="./models/llama-3.1-8b-q4_k_m.gguf", n_ctx=8192, n_gpu_layers=-1, # Offload all layers chat_format="llama-3", # Match model's chat template ) response = llm.create_chat_completion( messages=[ {"role": "system", "content": "You are a concise technical assistant."}, {"role": "user", "content": "What is a hash table?"}, ], max_tokens=256, temperature=0.7, ) print(response["choices"][0]["message"]["content"])

Streaming Output

for chunk in llm.create_chat_completion( messages=[{"role": "user", "content": "Write a haiku about programming."}], max_tokens=100, stream=True, ): delta = chunk["choices"][0]["delta"] if "content" in delta: print(delta["content"], end="", flush=True)

Server Deployment

llama.cpp Server (OpenAI-Compatible)

# Start server ./llama-server \ -m models/llama-3.1-8b-q4_k_m.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 35 \ -c 8192 \ --n-predict 512 \ -t 8 # Or with Python bindings python -m llama_cpp.server \ --model models/llama-3.1-8b-q4_k_m.gguf \ --n_gpu_layers 35 \ --host 0.0.0.0 \ --port 8080

Using with OpenAI Client

from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed", ) response = client.chat.completions.create( model="local-model", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain Docker containers briefly."}, ], max_tokens=256, temperature=0.7, ) print(response.choices[0].message.content)

Ollama Integration

# Create Modelfile cat > Modelfile << 'EOF' FROM ./llama-3.1-8b-q4_k_m.gguf PARAMETER temperature 0.7 PARAMETER num_ctx 8192 SYSTEM "You are a helpful technical assistant." EOF # Create and run model ollama create my-llama -f Modelfile ollama run my-llama "What is Kubernetes?"

Hardware Optimization

Apple Silicon (Metal)

# Build with Metal make clean && make GGML_METAL=1 # Run with full GPU offload ./llama-cli -m model.gguf -ngl 99 -p "Hello!"
# Python with Metal llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload all layers to Metal n_threads=1, # Metal handles parallelism use_mmap=True, # Memory-mapped file loading )

NVIDIA GPU (CUDA)

# Build with CUDA make clean && make GGML_CUDA=1 # Partial offload (when model exceeds VRAM) ./llama-cli -m model.gguf -ngl 20 -p "Hello!" # First 20 layers on GPU, rest on CPU

CPU Optimization

# Detect optimal thread count nproc # Linux sysctl -n hw.ncpu # macOS # Run with optimal threading ./llama-cli -m model.gguf -ngl 0 -t 8 -p "Hello!"

Configuration Reference

llama-cli Parameters

ParameterDefaultDescription
-mRequiredPath to GGUF model file
-pNonePrompt text
-n128Maximum tokens to generate
-ngl0Number of layers to offload to GPU
-c2048Context window size
-tAutoNumber of CPU threads
-b512Batch size for prompt processing
--temp0.8Sampling temperature
--top-k40Top-K sampling
--top-p0.9Top-P (nucleus) sampling
--repeat-penalty1.1Repetition penalty
--mlockOffLock model in RAM (prevent swap)
--mmapOnMemory-map model file

llama-server Parameters

ParameterDefaultDescription
--host127.0.0.1Listen address
--port8080Listen port
-ngl0GPU layers to offload
-c2048Context window
--n-predict-1Default max tokens (-1 = unlimited)
-tAutoCPU threads
--parallel1Number of parallel request slots
--embeddingOffEnable embedding endpoint
--cont-batchingOffEnable continuous batching

Best Practices

  1. Default to Q4_K_M quantization -- It offers the best quality-to-size ratio for general use. Only go lower (Q3_K_M) if memory is truly constrained.
  2. Use importance matrix for low-bit quantization -- Below Q4, the quality difference between standard and imatrix quantization is significant. Always use imatrix for Q2_K and Q3_K variants.
  3. Offload as many layers to GPU as possible -- Each GPU-offloaded layer provides substantial speedup. Use -ngl -1 to offload everything, or find the maximum that fits in VRAM.
  4. Match chat format to the model -- Set chat_format correctly in llama-cpp-python (e.g., "llama-3", "chatml", "mistral"). Wrong format produces garbled or repetitive output.
  5. Use --mmap for fast model loading -- Memory-mapped loading is the default and dramatically reduces startup time compared to reading the entire file into memory.
  6. Set context window appropriately -- Larger context windows consume proportionally more memory for KV cache. Set -c to the maximum you actually need, not the model's limit.
  7. Pre-quantize to multiple levels -- Create Q4_K_M, Q5_K_M, and Q8_0 variants from one F16 base. Users can then select based on their hardware constraints.
  8. Use Ollama for deployment simplicity -- For production serving, Ollama handles model management, API serving, and hardware detection automatically.
  9. Monitor generation speed -- llama.cpp reports tokens/second. On Apple Silicon M2, expect 15-30 tok/s for Q4_K_M 7B. On CPU-only, expect 3-10 tok/s depending on cores.
  10. Keep the F16 GGUF as your source of truth -- Always archive the FP16 GGUF file. You can re-quantize to any level from it, but cannot recover precision from a quantized file.

Troubleshooting

Model loads but generation is very slow: Check that GPU offloading is enabled (-ngl > 0). On Apple Silicon, build with GGML_METAL=1. Verify thread count matches physical cores, not logical cores.

Output is garbled or repetitive: Wrong chat template format. Set chat_format to match the model family. For Llama 3, use "llama-3". For Mistral, use "mistral-instruct". Check the model card on HuggingFace.

CUDA out of memory when offloading layers: Reduce -ngl to offload fewer layers. The remaining layers run on CPU. Find the sweet spot by binary search: try -ngl 20, then adjust up or down.

convert_hf_to_gguf.py fails with unsupported model: Not all architectures are supported. Check llama.cpp's supported model list. For newer architectures, update to the latest llama.cpp version.

llama-server returns empty responses: Check that --n-predict is not set to 0. Verify the prompt format matches the model's expected template. Test with llama-cli first to confirm the model works.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates