Prompt Caching Dynamic

Strategic prompt caching techniques for reducing LLM API costs by up to 90% through multi-level caching: prefix caching, full response caching, and semantic similarity matching.

When to Use

Implement prompt caching when:

High-volume LLM API calls with repetitive system prompts or contexts
Need to reduce API costs without sacrificing response quality
Applications with predictable query patterns (customer support, code review)
RAG systems where the same context chunks are frequently retrieved

Skip caching when:

Every query is unique with no repeated context
Freshness is critical (real-time data analysis)
Low query volume where caching overhead exceeds savings

Quick Start

Provider-Level Prefix Caching (Anthropic)


import anthropic

client = anthropic.Anthropic()

# The system prompt is cached after first call
# Subsequent calls with the same prefix get ~90% cost reduction
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a senior code reviewer...",  # Long system prompt
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Review this function..."}]
)

# Check cache usage
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")

Application-Level Response Caching


import hashlib
import json
from functools import lru_cache

class LLMCache:
    def __init__(self, cache_backend="redis"):
        if cache_backend == "redis":
            import redis
            self.cache = redis.Redis()
        self.ttl = 3600  # 1 hour default

    def _cache_key(self, model, messages, temperature):
        # Only cache deterministic requests (temperature=0)
        content = json.dumps({
            "model": model,
            "messages": messages,
            "temperature": temperature
        }, sort_keys=True)
        return f"llm:{hashlib.sha256(content.encode()).hexdigest()}"

    def get_or_call(self, client, **kwargs):
        if kwargs.get("temperature", 1) > 0:
            return client.messages.create(**kwargs)  # Don't cache non-deterministic

        key = self._cache_key(kwargs["model"], kwargs["messages"], 0)
        cached = self.cache.get(key)
        if cached:
            return json.loads(cached)

        response = client.messages.create(**kwargs)
        self.cache.setex(key, self.ttl, json.dumps(response.to_dict()))
        return response

Semantic Similarity Caching


import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
        self.threshold = similarity_threshold
        self.cache = []  # (embedding, query, response) tuples

    def get_or_call(self, query, llm_fn):
        query_embedding = self.encoder.encode(query)

        for cached_emb, cached_query, cached_response in self.cache:
            similarity = np.dot(query_embedding, cached_emb) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(cached_emb)
            )
            if similarity >= self.threshold:
                return cached_response

        response = llm_fn(query)
        self.cache.append((query_embedding, query, response))
        return response

Core Concepts

Caching Levels

Level	What's Cached	Cost Reduction	Latency Reduction	Implementation
Prefix	System prompt tokens	Up to 90% on input	Minimal	Provider API flag
Response	Full API responses	100% on cache hit	Near-instant	Application layer
Semantic	Similar query responses	100% on cache hit	Near-instant	Embedding + similarity

Cache Invalidation Strategies

Strategy	Use When	TTL
Time-based	General purpose	1-24 hours
Version-based	Prompt template changes	On deployment
Content-based	RAG context updates	On index rebuild
Manual	Emergency corrections	Immediate

Configuration

Parameter	Default	Description
`cache_backend`	"redis"	Storage backend (redis, memcached, in-memory)
`ttl`	3600	Cache entry time-to-live in seconds
`similarity_threshold`	0.95	Cosine similarity threshold for semantic cache
`max_cache_size`	10000	Maximum entries before eviction
`cache_deterministic_only`	True	Only cache temperature=0 requests

Best Practices

Only cache deterministic responses (temperature=0) — non-deterministic responses should not be cached
Use prefix caching for system prompts — move static context to the beginning for maximum cache hits
Set appropriate TTLs — too long risks stale responses, too short reduces cache effectiveness
Monitor cache hit rates — aim for 60%+ hit rate; below 30% suggests the workload isn't cache-friendly
Hash cache keys deterministically — sort JSON keys and normalize whitespace before hashing
Layer your caches — prefix caching + response caching + semantic caching for maximum savings

Common Issues

Low cache hit rate: Ensure system prompts and shared context appear at the beginning of messages (prefix position). Normalize query formatting before cache key generation.

Stale cached responses: Implement version-based invalidation when prompt templates change. Add content hashes for RAG contexts that update frequently.

Memory pressure from large cache: Use Redis or Memcached instead of in-memory caching. Set max_cache_size with LRU eviction. Compress cached responses before storage.

⚠️ Loading Issue

Prompt Caching Dynamic

Prompt Caching Dynamic

When to Use

Quick Start

Provider-Level Prefix Caching (Anthropic)

Application-Level Response Caching

Semantic Similarity Caching

Core Concepts

Caching Levels

Cache Invalidation Strategies

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace