{"id":97,"date":"2025-07-14T10:32:45","date_gmt":"2025-07-14T10:32:45","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/caching-strategies-for-llm-systems-that-actually-work\/"},"modified":"2025-07-14T10:32:45","modified_gmt":"2025-07-14T10:32:45","slug":"caching-strategies-for-llm-systems-that-actually-work","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/07\/14\/caching-strategies-for-llm-systems-that-actually-work\/","title":{"rendered":"Caching strategies for LLM systems that actually work"},"content":{"rendered":"<h2>The silent reason your LLM bill is 2x higher than it should be<\/h2>\n<p>If your latency is spiky, your OpenAI or self-hosted bill is creeping up, and your team keeps telling you they &#8220;turned on caching&#8221; already, I have a guess: you have a string-equality cache on the prompt and nothing else. That buys you a small win and hides the real opportunities.<\/p>\n<p>I keep seeing the same pattern. Teams push RAG or agent features, usage grows, then cost and tail latency bite. The fix is rarely a new model. It is proper caching across the LLM stack.<\/p>\n<h2>Where this breaks and why<\/h2>\n<ul>\n<li>Chat and copilots: repetitive intents with slightly different phrasing, high overlap in retrieval, same tool calls, zero reuse.<\/li>\n<li>RAG search: every query rebuilds the same embeddings, runs near-identical ANN searches, re-ranks the same candidates, regenerates similar summaries.<\/li>\n<li>Agents: deterministic tool chains with no memoization. The language model ends up recomputing what your software could have remembered.<\/li>\n<\/ul>\n<p>Why it happens:<br \/>\n&#8211; Caches are bolted on after launch, not designed as part of the architecture.<br \/>\n&#8211; Keys are naive. They ignore system prompt versions, tool schemas, corpus versions, locale, or sampling params.<br \/>\n&#8211; Fear of staleness blocks useful caches. So teams pick the only &#8220;safe&#8221; cache &#8211; exact prompt match &#8211; which has low hit rates.<\/p>\n<p>What most teams misunderstand:<br \/>\n&#8211; You do not need a single cache. You need layers that match the cost profile: embeddings, retrieval, tools, final generation.<br \/>\n&#8211; Staleness is not binary. Use soft TTL with background refresh and versioned keys instead of global disables.<br \/>\n&#8211; Semantic caches can work if you gate them properly. But you must treat them like a product feature, not a toggle.<\/p>\n<h2>The architecture that actually works<\/h2>\n<p>Think in layers. Each layer has a different key, TTL, and failure mode.<\/p>\n<h3>L0. In-flight request de-duplication<\/h3>\n<ul>\n<li>Coalesce identical requests already in progress. Prevent thundering herds on cache miss.<\/li>\n<li>Key on the same normalized key you would use for L1.<\/li>\n<li>Pattern: single-flight per key with a short window.<\/li>\n<\/ul>\n<p>Failure mode: if you forget this, a trending query can spawn hundreds of identical LLM calls at once.<\/p>\n<h3>L1. Exact-match prompt cache with correct keys<\/h3>\n<ul>\n<li>Value: final model output + minimal metadata.<\/li>\n<li>Store: per-instance LRU for microseconds, Redis or Memcached for cross-instance.<\/li>\n<li>TTL: minutes to hours depending on content freshness.<\/li>\n<\/ul>\n<p>Key must include at least:<br \/>\n&#8211; tenant_id and user segment (or locale) to avoid cross-tenant leaks<br \/>\n&#8211; model_id@version and sampling params digest (temperature, top_p, seed)<br \/>\n&#8211; system_prompt_digest and tool_schema_digest<br \/>\n&#8211; safety_policy_version<br \/>\n&#8211; normalized user input<\/p>\n<p>Normalization matters. Trim whitespace, lowercase where safe, collapse numeric formatting, sort JSON keys, freeze tool schema serialization. I have seen 10 to 20 percent hit-rate gains from just cleaning inputs.<\/p>\n<p>Failure modes:<br \/>\n&#8211; Model upgrade without versioned keys returns inconsistent outputs from old cache entries.<br \/>\n&#8211; Cross-locale reuse leads to wrong tone or units.<\/p>\n<h3>L2. Semantic prompt cache &#8211; used carefully<\/h3>\n<ul>\n<li>Goal: reuse answers for semantically equivalent queries.<\/li>\n<li>Store: vector index keyed by query embedding, value points to an answer blob in object store.<\/li>\n<li>Gating: cosine similarity threshold tuned offline, plus a cheap verifier step.<\/li>\n<\/ul>\n<p>Verifier options:<br \/>\n&#8211; Fast model check: ask a small model to validate reuse with a yes\/no verdict.<br \/>\n&#8211; Retrieval overlap check: run retrieval for the new query, compare top-k doc IDs with cached answer sources. Require overlap >= X%.<\/p>\n<p>TTL: short. Think minutes to a few hours, unless your domain is static.<\/p>\n<p>Failure modes:<br \/>\n&#8211; Domain drift and temporal queries. Do not reuse &#8220;What is today\u2019s rate&#8221; style questions.<br \/>\n&#8211; Hidden context. If answers depend on user profile, bake a profile_version into the key or skip semantic caching.<\/p>\n<h3>L3. Tool and deterministic step cache<\/h3>\n<ul>\n<li>Many agent tools are pure functions over structured inputs. Cache them like normal software.<\/li>\n<li>Good candidates: currency conversion, calendar availability, pricing lookups, HTML to markdown conversions, doc parsers, summarizers over immutable docs.<\/li>\n<\/ul>\n<p>Keys:<br \/>\n&#8211; tool_name + stable-serialized input + upstream dataset version<br \/>\n&#8211; TTL: long for slow-changing data, short for live data with its own SLA<\/p>\n<p>Failure modes:<br \/>\n&#8211; Assuming idempotence where it does not exist. If the tool queries a live API, treat it as non-deterministic.<\/p>\n<h3>L4. RAG retrieval cache<\/h3>\n<ul>\n<li>Do not recompute embeddings for the same documents. Do not re-run the same ANN queries if your corpus did not change.<\/li>\n<\/ul>\n<p>Sub-layers:<br \/>\n&#8211; Embedding cache: key on content_hash + embed_model_version. TTL effectively infinite until doc changes.<br \/>\n&#8211; ANN query cache: key on quantized query embedding bucket + corpus_version + retrieval_params. TTL short.<br \/>\n&#8211; Re-ranker cache: key on candidate_IDs + re_ranker_model@version.<\/p>\n<p>Corpus versioning is the unlock. Every ingest or update increments a collection_version. Include it in all keys. No need for expensive invalidation jobs.<\/p>\n<p>Failure modes:<br \/>\n&#8211; Ignoring re-ranker changes. If you switch cross-encoder versions, stale rankings hurt quality.<\/p>\n<h3>L5. Synthesis and summary caches<\/h3>\n<ul>\n<li>For FAQ-like prompts or dashboard captions over fixed data, precompute.<\/li>\n<li>Key on template_id + slot_values_digest + data_snapshot_version.<\/li>\n<li>Store large values in object storage, index metadata in Redis.<\/li>\n<\/ul>\n<p>Failure modes:<br \/>\n&#8211; Over-personalization. Keep stable templated answers and apply light post-processing for name or locale to maintain reuse.<\/p>\n<h3>L6. Model-side optimizations when you self-host<\/h3>\n<ul>\n<li>KV-cache reuse across conversation turns can cut generation latency. Works if you pin the model and session state.<\/li>\n<li>Speculative decoding is not a cache, but it lowers perceived latency. Pair it with L1 to smooth p95.<\/li>\n<\/ul>\n<p>Failure modes:<br \/>\n&#8211; KV state blow-up under high concurrency. Enforce eviction policies per session and ceiling memory per GPU.<\/p>\n<h2>Trade-offs and failure modes you will actually hit<\/h2>\n<ul>\n<li>Staleness vs cost: you will ship the wrong answer at some point. Use soft TTL with background refresh to cap damage.<\/li>\n<li>Key explosions: too many dimensions will tank hit rates. Start with the essentials, add dimensions only when you see collisions.<\/li>\n<li>Semantic cache regressions: a single bad reuse can crater trust. Gate aggressively and audit.<\/li>\n<li>Stampedes: miss storms can wipe out your p99. Single-flight, jittered TTLs, and warmers are not optional.<\/li>\n<li>Memory pressure: L1 caches fight with your app memory. Cap by bytes, not just item count.<\/li>\n<\/ul>\n<h2>Practical designs that work at scale<\/h2>\n<h3>Build a strict cache key once<\/h3>\n<p>Create a library that every service uses. Do not let teams hand-roll keys.<\/p>\n<p>Key components I include in most systems:<br \/>\n&#8211; tenant_id<br \/>\n&#8211; model_id@version<br \/>\n&#8211; system_prompt_digest<br \/>\n&#8211; tool_schema_digest<br \/>\n&#8211; sampling_params_digest<br \/>\n&#8211; query_normalized or query_embedding_bucket<br \/>\n&#8211; corpus_or_dataset_version<br \/>\n&#8211; locale and policy_version<br \/>\n&#8211; persona_or_profile_version when personalization affects content<\/p>\n<h3>Tiered storage<\/h3>\n<ul>\n<li>L1: in-process LRU for micro hits. 100 to 500 ms TTL can already flatten spikes.<\/li>\n<li>L2: Redis or Memcached for cross-instance. Keep payloads under 1 MB.<\/li>\n<li>L3: Object store for large answers and traces. Index pointers in Redis.<\/li>\n<li>For semantic caches: dedicated vector DB or an ANN library with HNSW\/IVF. Store only embeddings and small metadata; keep answers in object store.<\/li>\n<\/ul>\n<h3>Soft TTL with background refresh<\/h3>\n<ul>\n<li>Return cached value immediately if age &lt; soft_ttl.<\/li>\n<li>If age >= soft_ttl but &lt; hard_ttl, return cached value and trigger a non-blocking refresh.<\/li>\n<li>Past hard_ttl, block and recompute, protected by single-flight.<\/li>\n<\/ul>\n<h3>Guardrails and observability<\/h3>\n<ul>\n<li>Per-layer hit rate, p50 and p95 latency, token savings, and error rates.<\/li>\n<li>Separate dashboards for exact vs semantic caches. If semantic reuses have lower user satisfaction, roll them back quickly.<\/li>\n<li>Include cache metadata in traces. When you get a bad answer, you need to know which layer served it.<\/li>\n<\/ul>\n<h3>Security and compliance<\/h3>\n<ul>\n<li>Do not store PII in cache values unless encrypted with a managed KMS. Better: separate IDs from content.<\/li>\n<li>Strict tenant scoping in keys and storage. I have seen cross-tenant cache reads happen with sloppy prefixes.<\/li>\n<li>Expunge controls: if a document is subject to deletion, increment corpus_version and wipe relevant keys by prefix.<\/li>\n<\/ul>\n<h2>What this saves in practice<\/h2>\n<ul>\n<li>Exact-match prompt cache with proper normalization: 15 to 35 percent hit rate on many chat workloads.<\/li>\n<li>Embedding cache: 70 to 95 percent reduction in embedding calls on steady-state corpora.<\/li>\n<li>Retrieval and re-ranker caches: 30 to 60 percent fewer ANN and re-rank ops on common queries.<\/li>\n<li>Semantic cache with strict gating: 5 to 20 percent extra reuse, highly domain dependent.<\/li>\n<\/ul>\n<p>Combined, I regularly see 30 to 60 percent lower token spend and 25 to 50 percent lower p95 latency. The remaining tail usually comes from long generations or cold tool paths.<\/p>\n<p>Scaling risk if you skip this: at 100 QPS with 1k-token outputs, you will either melt your quota or spend a lot for nothing. Caching is the only way to bend that curve without strangling features.<\/p>\n<h2>Quick recipes you can ship this sprint<\/h2>\n<ul>\n<li>Normalize everything. Centralize prompt rendering and JSON serialization so keys are deterministic.<\/li>\n<li>Add single-flight for in-flight dedupe. It pays for itself in one incident.<\/li>\n<li>Version your corpora and prompts. No more global invalidation jobs.<\/li>\n<li>Turn embeddings into an ingest-time concern. Hash the content, cache the vector, stop recomputing on read.<\/li>\n<li>Start a conservative semantic cache behind a feature flag. Similarity threshold high, verifier on, short TTL. Measure.<\/li>\n<li>Instrument per-layer hit rates and token savings. If you cannot see it, it did not happen.<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Caching is a system design problem, not a Redis toggle.<\/li>\n<li>Layer your caches: in-flight, exact match, semantic, tools, retrieval, synthesis.<\/li>\n<li>Keys must include model, prompt, params, corpus version, and tenant to be safe and useful.<\/li>\n<li>Use soft TTL with background refresh to contain staleness without giving up speed.<\/li>\n<li>Semantic caches work if you gate and verify. Otherwise they will burn trust.<\/li>\n<li>Make embeddings a write-time cost. Stop paying for them at read-time.<\/li>\n<\/ul>\n<h2>If you need help<\/h2>\n<p>If this sounds familiar and you are juggling cost, latency, and correctness, I have designed and fixed these caching layers for teams running real traffic. If you want an external set of hands to make this work without six months of experiments, this is exactly the kind of thing I help teams ship when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The silent reason your LLM bill is 2x higher than it should be If your latency is spiky, your OpenAI or self-hosted bill is creeping up, and your team keeps&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3],"tags":[17,24,15],"class_list":["post-97","post","type-post","status-publish","format-standard","hentry","category-ai-architecture","tag-ai-cost","tag-ai-latency","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/97","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=97"}],"version-history":[{"count":0,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/97\/revisions"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=97"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=97"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=97"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}