P

Prompt Caching Dynamic

All-in-one skill covering caching, strategies, prompts, including. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Prompt Caching Dynamic

Strategic prompt caching techniques for reducing LLM API costs by up to 90% through multi-level caching: prefix caching, full response caching, and semantic similarity matching.

When to Use

Implement prompt caching when:

  • High-volume LLM API calls with repetitive system prompts or contexts
  • Need to reduce API costs without sacrificing response quality
  • Applications with predictable query patterns (customer support, code review)
  • RAG systems where the same context chunks are frequently retrieved

Skip caching when:

  • Every query is unique with no repeated context
  • Freshness is critical (real-time data analysis)
  • Low query volume where caching overhead exceeds savings

Quick Start

Provider-Level Prefix Caching (Anthropic)

import anthropic client = anthropic.Anthropic() # The system prompt is cached after first call # Subsequent calls with the same prefix get ~90% cost reduction response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[ { "type": "text", "text": "You are a senior code reviewer...", # Long system prompt "cache_control": {"type": "ephemeral"} } ], messages=[{"role": "user", "content": "Review this function..."}] ) # Check cache usage print(f"Cache read tokens: {response.usage.cache_read_input_tokens}") print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")

Application-Level Response Caching

import hashlib import json from functools import lru_cache class LLMCache: def __init__(self, cache_backend="redis"): if cache_backend == "redis": import redis self.cache = redis.Redis() self.ttl = 3600 # 1 hour default def _cache_key(self, model, messages, temperature): # Only cache deterministic requests (temperature=0) content = json.dumps({ "model": model, "messages": messages, "temperature": temperature }, sort_keys=True) return f"llm:{hashlib.sha256(content.encode()).hexdigest()}" def get_or_call(self, client, **kwargs): if kwargs.get("temperature", 1) > 0: return client.messages.create(**kwargs) # Don't cache non-deterministic key = self._cache_key(kwargs["model"], kwargs["messages"], 0) cached = self.cache.get(key) if cached: return json.loads(cached) response = client.messages.create(**kwargs) self.cache.setex(key, self.ttl, json.dumps(response.to_dict())) return response

Semantic Similarity Caching

import numpy as np from sentence_transformers import SentenceTransformer class SemanticCache: def __init__(self, similarity_threshold=0.95): self.encoder = SentenceTransformer("all-MiniLM-L6-v2") self.threshold = similarity_threshold self.cache = [] # (embedding, query, response) tuples def get_or_call(self, query, llm_fn): query_embedding = self.encoder.encode(query) for cached_emb, cached_query, cached_response in self.cache: similarity = np.dot(query_embedding, cached_emb) / ( np.linalg.norm(query_embedding) * np.linalg.norm(cached_emb) ) if similarity >= self.threshold: return cached_response response = llm_fn(query) self.cache.append((query_embedding, query, response)) return response

Core Concepts

Caching Levels

LevelWhat's CachedCost ReductionLatency ReductionImplementation
PrefixSystem prompt tokensUp to 90% on inputMinimalProvider API flag
ResponseFull API responses100% on cache hitNear-instantApplication layer
SemanticSimilar query responses100% on cache hitNear-instantEmbedding + similarity

Cache Invalidation Strategies

StrategyUse WhenTTL
Time-basedGeneral purpose1-24 hours
Version-basedPrompt template changesOn deployment
Content-basedRAG context updatesOn index rebuild
ManualEmergency correctionsImmediate

Configuration

ParameterDefaultDescription
cache_backend"redis"Storage backend (redis, memcached, in-memory)
ttl3600Cache entry time-to-live in seconds
similarity_threshold0.95Cosine similarity threshold for semantic cache
max_cache_size10000Maximum entries before eviction
cache_deterministic_onlyTrueOnly cache temperature=0 requests

Best Practices

  1. Only cache deterministic responses (temperature=0) — non-deterministic responses should not be cached
  2. Use prefix caching for system prompts — move static context to the beginning for maximum cache hits
  3. Set appropriate TTLs — too long risks stale responses, too short reduces cache effectiveness
  4. Monitor cache hit rates — aim for 60%+ hit rate; below 30% suggests the workload isn't cache-friendly
  5. Hash cache keys deterministically — sort JSON keys and normalize whitespace before hashing
  6. Layer your caches — prefix caching + response caching + semantic caching for maximum savings

Common Issues

Low cache hit rate: Ensure system prompts and shared context appear at the beginning of messages (prefix position). Normalize query formatting before cache key generation.

Stale cached responses: Implement version-based invalidation when prompt templates change. Add content hashes for RAG contexts that update frequently.

Memory pressure from large cache: Use Redis or Memcached instead of in-memory caching. Set max_cache_size with LRU eviction. Compress cached responses before storage.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates