Prompt Caching Dynamic
All-in-one skill covering caching, strategies, prompts, including. Includes structured workflows, validation checks, and reusable patterns for ai research.
Prompt Caching Dynamic
Strategic prompt caching techniques for reducing LLM API costs by up to 90% through multi-level caching: prefix caching, full response caching, and semantic similarity matching.
When to Use
Implement prompt caching when:
- High-volume LLM API calls with repetitive system prompts or contexts
- Need to reduce API costs without sacrificing response quality
- Applications with predictable query patterns (customer support, code review)
- RAG systems where the same context chunks are frequently retrieved
Skip caching when:
- Every query is unique with no repeated context
- Freshness is critical (real-time data analysis)
- Low query volume where caching overhead exceeds savings
Quick Start
Provider-Level Prefix Caching (Anthropic)
import anthropic client = anthropic.Anthropic() # The system prompt is cached after first call # Subsequent calls with the same prefix get ~90% cost reduction response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[ { "type": "text", "text": "You are a senior code reviewer...", # Long system prompt "cache_control": {"type": "ephemeral"} } ], messages=[{"role": "user", "content": "Review this function..."}] ) # Check cache usage print(f"Cache read tokens: {response.usage.cache_read_input_tokens}") print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
Application-Level Response Caching
import hashlib import json from functools import lru_cache class LLMCache: def __init__(self, cache_backend="redis"): if cache_backend == "redis": import redis self.cache = redis.Redis() self.ttl = 3600 # 1 hour default def _cache_key(self, model, messages, temperature): # Only cache deterministic requests (temperature=0) content = json.dumps({ "model": model, "messages": messages, "temperature": temperature }, sort_keys=True) return f"llm:{hashlib.sha256(content.encode()).hexdigest()}" def get_or_call(self, client, **kwargs): if kwargs.get("temperature", 1) > 0: return client.messages.create(**kwargs) # Don't cache non-deterministic key = self._cache_key(kwargs["model"], kwargs["messages"], 0) cached = self.cache.get(key) if cached: return json.loads(cached) response = client.messages.create(**kwargs) self.cache.setex(key, self.ttl, json.dumps(response.to_dict())) return response
Semantic Similarity Caching
import numpy as np from sentence_transformers import SentenceTransformer class SemanticCache: def __init__(self, similarity_threshold=0.95): self.encoder = SentenceTransformer("all-MiniLM-L6-v2") self.threshold = similarity_threshold self.cache = [] # (embedding, query, response) tuples def get_or_call(self, query, llm_fn): query_embedding = self.encoder.encode(query) for cached_emb, cached_query, cached_response in self.cache: similarity = np.dot(query_embedding, cached_emb) / ( np.linalg.norm(query_embedding) * np.linalg.norm(cached_emb) ) if similarity >= self.threshold: return cached_response response = llm_fn(query) self.cache.append((query_embedding, query, response)) return response
Core Concepts
Caching Levels
| Level | What's Cached | Cost Reduction | Latency Reduction | Implementation |
|---|---|---|---|---|
| Prefix | System prompt tokens | Up to 90% on input | Minimal | Provider API flag |
| Response | Full API responses | 100% on cache hit | Near-instant | Application layer |
| Semantic | Similar query responses | 100% on cache hit | Near-instant | Embedding + similarity |
Cache Invalidation Strategies
| Strategy | Use When | TTL |
|---|---|---|
| Time-based | General purpose | 1-24 hours |
| Version-based | Prompt template changes | On deployment |
| Content-based | RAG context updates | On index rebuild |
| Manual | Emergency corrections | Immediate |
Configuration
| Parameter | Default | Description |
|---|---|---|
cache_backend | "redis" | Storage backend (redis, memcached, in-memory) |
ttl | 3600 | Cache entry time-to-live in seconds |
similarity_threshold | 0.95 | Cosine similarity threshold for semantic cache |
max_cache_size | 10000 | Maximum entries before eviction |
cache_deterministic_only | True | Only cache temperature=0 requests |
Best Practices
- Only cache deterministic responses (temperature=0) — non-deterministic responses should not be cached
- Use prefix caching for system prompts — move static context to the beginning for maximum cache hits
- Set appropriate TTLs — too long risks stale responses, too short reduces cache effectiveness
- Monitor cache hit rates — aim for 60%+ hit rate; below 30% suggests the workload isn't cache-friendly
- Hash cache keys deterministically — sort JSON keys and normalize whitespace before hashing
- Layer your caches — prefix caching + response caching + semantic caching for maximum savings
Common Issues
Low cache hit rate: Ensure system prompts and shared context appear at the beginning of messages (prefix position). Normalize query formatting before cache key generation.
Stale cached responses: Implement version-based invalidation when prompt templates change. Add content hashes for RAG contexts that update frequently.
Memory pressure from large cache:
Use Redis or Memcached instead of in-memory caching. Set max_cache_size with LRU eviction. Compress cached responses before storage.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.