I

Inference Serving Llama Engine

Comprehensive skill designed for runs, inference, apple, silicon. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

LLM Inference with llama.cpp

Overview

A comprehensive skill for running LLM inference using llama.cpp — the C/C++ runtime that enables fast CPU and GPU inference for quantized models. llama.cpp powers local AI applications through Ollama, LM Studio, and other tools, supporting GGUF-format models with various quantization levels from Q2 to FP16 on consumer hardware.

When to Use

  • Running LLMs locally on consumer hardware
  • Need CPU-only or hybrid CPU+GPU inference
  • Serving quantized models (GGUF format)
  • Building local AI applications without cloud dependencies
  • Running models on Mac with Metal acceleration
  • Edge deployment with minimal dependencies

Quick Start

# Build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # or -DGGML_METAL=ON for Mac cmake --build build --config Release -j # Download a GGUF model # From HuggingFace: search for "GGUF" versions of popular models # Run inference ./build/bin/llama-cli -m model.gguf \ -p "Explain quantum computing" \ -n 256 --temp 0.7

Python Bindings

from llama_cpp import Llama llm = Llama( model_path="./models/llama-3-8b-instruct-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=-1, # Offload all layers to GPU (-1 = all) n_threads=8, # CPU threads for CPU layers verbose=False, ) # Chat completion (OpenAI-compatible) response = llm.create_chat_completion( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, ], temperature=0.7, max_tokens=512, ) print(response["choices"][0]["message"]["content"])

OpenAI-Compatible Server

# Start server ./build/bin/llama-server -m model.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 99 -c 4096 # Use with any OpenAI client curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "model", "messages": [{"role": "user", "content": "Hello"}] }'

Quantization Levels

QuantizationBits/WeightQualitySize (7B)Speed
Q2_K2.5Low2.8 GBFastest
Q3_K_M3.4Fair3.3 GBFast
Q4_K_M4.5Good4.1 GBFast
Q5_K_M5.5Very Good4.8 GBMedium
Q6_K6.5Excellent5.5 GBMedium
Q8_08.0Near Perfect7.2 GBSlower
F1616.0Perfect14.0 GBSlowest

Configuration Reference

ParameterDefaultDescription
-mRequiredModel file path (GGUF)
-c / n_ctx512Context window size
-ngl / n_gpu_layers0Layers to offload to GPU
-t / n_threadsAutoCPU threads
-b / n_batch512Batch size for prompt processing
--temp0.8Sampling temperature
--top-k40Top-k sampling
--top-p0.95Nucleus sampling
--repeat-penalty1.1Repetition penalty
-n-1Max tokens to generate (-1 = infinite)
--mlockfalseLock model in RAM (no swap)

Hardware Requirements

Model SizeQ4_K_M SizeMin RAMRecommended GPU
1-3B1-2 GB4 GBAny / CPU only
7-8B4-5 GB8 GB6GB VRAM
13B7-8 GB16 GB12GB VRAM
30-34B18-20 GB32 GB24GB VRAM
70B38-42 GB64 GB48GB VRAM

Best Practices

  1. Use Q4_K_M for most use cases — Best balance of quality and performance
  2. Offload all layers to GPU-ngl -1 for maximum speed when VRAM permits
  3. Use mmap for large models — Faster loading, shared memory across processes
  4. Set context to what you need — Larger context uses more memory
  5. Use Metal on Mac — Native Apple Silicon acceleration, no CUDA needed
  6. Use the OpenAI-compatible server — Easy integration with existing OpenAI client code
  7. Monitor VRAM usage — Partially offloaded models split between CPU and GPU
  8. Use flash attention--flash-attn for better memory efficiency on supported hardware
  9. Benchmark with your workload — Speed varies significantly by model, quant, and hardware
  10. Keep models on SSD — Loading from HDD is much slower due to GGUF's mmap usage

Troubleshooting

Slow generation speed

# Check GPU offloading # -ngl -1 should show "CUDA" or "Metal" in load logs # If CPU only, increase thread count -t $(nproc) # Use flash attention --flash-attn

Out of memory

# Reduce context window -c 2048 # Use smaller quantization # Q4_K_M → Q3_K_M → Q2_K # Partially offload to GPU -ngl 20 # Only some layers

Model not loading

# Check GGUF format version # llama.cpp updates may require re-quantized models # Verify model integrity md5sum model.gguf
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates