Inference Serving Llama Engine
Comprehensive skill designed for runs, inference, apple, silicon. Includes structured workflows, validation checks, and reusable patterns for ai research.
LLM Inference with llama.cpp
Overview
A comprehensive skill for running LLM inference using llama.cpp — the C/C++ runtime that enables fast CPU and GPU inference for quantized models. llama.cpp powers local AI applications through Ollama, LM Studio, and other tools, supporting GGUF-format models with various quantization levels from Q2 to FP16 on consumer hardware.
When to Use
- Running LLMs locally on consumer hardware
- Need CPU-only or hybrid CPU+GPU inference
- Serving quantized models (GGUF format)
- Building local AI applications without cloud dependencies
- Running models on Mac with Metal acceleration
- Edge deployment with minimal dependencies
Quick Start
# Build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # or -DGGML_METAL=ON for Mac cmake --build build --config Release -j # Download a GGUF model # From HuggingFace: search for "GGUF" versions of popular models # Run inference ./build/bin/llama-cli -m model.gguf \ -p "Explain quantum computing" \ -n 256 --temp 0.7
Python Bindings
from llama_cpp import Llama llm = Llama( model_path="./models/llama-3-8b-instruct-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=-1, # Offload all layers to GPU (-1 = all) n_threads=8, # CPU threads for CPU layers verbose=False, ) # Chat completion (OpenAI-compatible) response = llm.create_chat_completion( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, ], temperature=0.7, max_tokens=512, ) print(response["choices"][0]["message"]["content"])
OpenAI-Compatible Server
# Start server ./build/bin/llama-server -m model.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 99 -c 4096 # Use with any OpenAI client curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "model", "messages": [{"role": "user", "content": "Hello"}] }'
Quantization Levels
| Quantization | Bits/Weight | Quality | Size (7B) | Speed |
|---|---|---|---|---|
| Q2_K | 2.5 | Low | 2.8 GB | Fastest |
| Q3_K_M | 3.4 | Fair | 3.3 GB | Fast |
| Q4_K_M | 4.5 | Good | 4.1 GB | Fast |
| Q5_K_M | 5.5 | Very Good | 4.8 GB | Medium |
| Q6_K | 6.5 | Excellent | 5.5 GB | Medium |
| Q8_0 | 8.0 | Near Perfect | 7.2 GB | Slower |
| F16 | 16.0 | Perfect | 14.0 GB | Slowest |
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
-m | Required | Model file path (GGUF) |
-c / n_ctx | 512 | Context window size |
-ngl / n_gpu_layers | 0 | Layers to offload to GPU |
-t / n_threads | Auto | CPU threads |
-b / n_batch | 512 | Batch size for prompt processing |
--temp | 0.8 | Sampling temperature |
--top-k | 40 | Top-k sampling |
--top-p | 0.95 | Nucleus sampling |
--repeat-penalty | 1.1 | Repetition penalty |
-n | -1 | Max tokens to generate (-1 = infinite) |
--mlock | false | Lock model in RAM (no swap) |
Hardware Requirements
| Model Size | Q4_K_M Size | Min RAM | Recommended GPU |
|---|---|---|---|
| 1-3B | 1-2 GB | 4 GB | Any / CPU only |
| 7-8B | 4-5 GB | 8 GB | 6GB VRAM |
| 13B | 7-8 GB | 16 GB | 12GB VRAM |
| 30-34B | 18-20 GB | 32 GB | 24GB VRAM |
| 70B | 38-42 GB | 64 GB | 48GB VRAM |
Best Practices
- Use Q4_K_M for most use cases — Best balance of quality and performance
- Offload all layers to GPU —
-ngl -1for maximum speed when VRAM permits - Use mmap for large models — Faster loading, shared memory across processes
- Set context to what you need — Larger context uses more memory
- Use Metal on Mac — Native Apple Silicon acceleration, no CUDA needed
- Use the OpenAI-compatible server — Easy integration with existing OpenAI client code
- Monitor VRAM usage — Partially offloaded models split between CPU and GPU
- Use flash attention —
--flash-attnfor better memory efficiency on supported hardware - Benchmark with your workload — Speed varies significantly by model, quant, and hardware
- Keep models on SSD — Loading from HDD is much slower due to GGUF's mmap usage
Troubleshooting
Slow generation speed
# Check GPU offloading # -ngl -1 should show "CUDA" or "Metal" in load logs # If CPU only, increase thread count -t $(nproc) # Use flash attention --flash-attn
Out of memory
# Reduce context window -c 2048 # Use smaller quantization # Q4_K_M → Q3_K_M → Q2_K # Partially offload to GPU -ngl 20 # Only some layers
Model not loading
# Check GGUF format version # llama.cpp updates may require re-quantized models # Verify model integrity md5sum model.gguf
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.