LLM Inference with llama.cpp

Overview

A comprehensive skill for running LLM inference using llama.cpp — the C/C++ runtime that enables fast CPU and GPU inference for quantized models. llama.cpp powers local AI applications through Ollama, LM Studio, and other tools, supporting GGUF-format models with various quantization levels from Q2 to FP16 on consumer hardware.

When to Use

Running LLMs locally on consumer hardware
Need CPU-only or hybrid CPU+GPU inference
Serving quantized models (GGUF format)
Building local AI applications without cloud dependencies
Running models on Mac with Metal acceleration
Edge deployment with minimal dependencies

Quick Start


# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # or -DGGML_METAL=ON for Mac
cmake --build build --config Release -j

# Download a GGUF model
# From HuggingFace: search for "GGUF" versions of popular models

# Run inference
./build/bin/llama-cli -m model.gguf \
  -p "Explain quantum computing" \
  -n 256 --temp 0.7

Python Bindings


from llama_cpp import Llama

llm = Llama(
    model_path="./models/llama-3-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,        # Context window
    n_gpu_layers=-1,   # Offload all layers to GPU (-1 = all)
    n_threads=8,       # CPU threads for CPU layers
    verbose=False,
)

# Chat completion (OpenAI-compatible)
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response["choices"][0]["message"]["content"])

OpenAI-Compatible Server


# Start server
./build/bin/llama-server -m model.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 -c 4096

# Use with any OpenAI client
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Quantization Levels

Quantization	Bits/Weight	Quality	Size (7B)	Speed
Q2_K	2.5	Low	2.8 GB	Fastest
Q3_K_M	3.4	Fair	3.3 GB	Fast
Q4_K_M	4.5	Good	4.1 GB	Fast
Q5_K_M	5.5	Very Good	4.8 GB	Medium
Q6_K	6.5	Excellent	5.5 GB	Medium
Q8_0	8.0	Near Perfect	7.2 GB	Slower
F16	16.0	Perfect	14.0 GB	Slowest

Configuration Reference

Parameter	Default	Description
`-m`	Required	Model file path (GGUF)
`-c` / `n_ctx`	512	Context window size
`-ngl` / `n_gpu_layers`	0	Layers to offload to GPU
`-t` / `n_threads`	Auto	CPU threads
`-b` / `n_batch`	512	Batch size for prompt processing
`--temp`	0.8	Sampling temperature
`--top-k`	40	Top-k sampling
`--top-p`	0.95	Nucleus sampling
`--repeat-penalty`	1.1	Repetition penalty
`-n`	-1	Max tokens to generate (-1 = infinite)
`--mlock`	false	Lock model in RAM (no swap)

Hardware Requirements

Model Size	Q4_K_M Size	Min RAM	Recommended GPU
1-3B	1-2 GB	4 GB	Any / CPU only
7-8B	4-5 GB	8 GB	6GB VRAM
13B	7-8 GB	16 GB	12GB VRAM
30-34B	18-20 GB	32 GB	24GB VRAM
70B	38-42 GB	64 GB	48GB VRAM

Best Practices

Use Q4_K_M for most use cases — Best balance of quality and performance
Offload all layers to GPU — -ngl -1 for maximum speed when VRAM permits
Use mmap for large models — Faster loading, shared memory across processes
Set context to what you need — Larger context uses more memory
Use Metal on Mac — Native Apple Silicon acceleration, no CUDA needed
Use the OpenAI-compatible server — Easy integration with existing OpenAI client code
Monitor VRAM usage — Partially offloaded models split between CPU and GPU
Use flash attention — --flash-attn for better memory efficiency on supported hardware
Benchmark with your workload — Speed varies significantly by model, quant, and hardware
Keep models on SSD — Loading from HDD is much slower due to GGUF's mmap usage

Troubleshooting

Slow generation speed


# Check GPU offloading
# -ngl -1 should show "CUDA" or "Metal" in load logs
# If CPU only, increase thread count
-t $(nproc)
# Use flash attention
--flash-attn

Out of memory


# Reduce context window
-c 2048
# Use smaller quantization
# Q4_K_M → Q3_K_M → Q2_K
# Partially offload to GPU
-ngl 20  # Only some layers

Model not loading


# Check GGUF format version
# llama.cpp updates may require re-quantized models
# Verify model integrity
md5sum model.gguf

⚠️ Loading Issue

Inference Serving Llama Engine