{"id":62,"date":"2025-04-14T10:35:22","date_gmt":"2025-04-14T10:35:22","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/stateless-vs-stateful-ai-systems-what-works-at-scale\/"},"modified":"2026-04-09T23:26:18","modified_gmt":"2026-04-09T23:26:18","slug":"stateless-vs-stateful-ai-systems-what-works-at-scale","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/04\/14\/stateless-vs-stateful-ai-systems-what-works-at-scale\/","title":{"rendered":"Stateless vs stateful AI systems: what actually works at scale"},"content":{"rendered":"<h2>The fastest way to blow your LLM budget<\/h2>\n<p>The fastest way to blow your LLM budget is to keep shoving yesterday&#8217;s conversation back into the prompt on every turn. I keep seeing teams call that \u201cstateful,\u201d then wonder why latency spikes, costs drift, and answers degrade as context gets noisy. The other extreme is teams forcing \u201cstateless\u201d everywhere and then building a Rube Goldberg device of caches and retries to fake memory, which collapses the first time you have concurrency or a partial outage.<\/p>\n<p>This post is not theory. It is the pattern that has held up across customer support copilots, analytics agents, and workflow automation in production. The short version: keep compute stateless, keep data stateful, and make the boundary between them explicit and metered.<\/p>\n<h2>Where the problem shows up, and why<\/h2>\n<ul>\n<li>Chat and agents: every turn drags a growing transcript. Token tax grows linearly with session length, then performance degrades due to noisy context.<\/li>\n<li>RAG systems: per-user or per-tenant knowledge sprinkled across prompts, KV stores, and vector DB metadata. No single source of truth. Stale or duplicated facts sneak in.<\/li>\n<li>Tool-using agents: retries re-execute tools because there is no idempotency key. State gets double-written or tasks are duplicated.<\/li>\n<li>Personalization: teams store raw history as \u201cmemory,\u201d but it is unstructured and unsummarized. Useful facts drown in logs.<\/li>\n<\/ul>\n<p>Why it happens in real systems:<br \/>\n&#8211; People conflate product memory with model context. Not the same thing.<br \/>\n&#8211; LLM latency and cost are visible, storage and state sprawl are invisible until it is too late.<br \/>\n&#8211; Framework defaults encourage stuffing everything into the prompt because it is the fastest way to get a demo working.<\/p>\n<p>What most teams misunderstand:<br \/>\n&#8211; Stateless is a property of workers, not the product. You can be operationally stateless while being product-stateful by reconstructing state on demand.<br \/>\n&#8211; Memory is not a blob. There are types of state that need different stores, different lifetimes, and different consistency rules.<\/p>\n<h2>Technical deep dive: the architecture that scales<\/h2>\n<p>Think in four buckets of state:<\/p>\n<p>1) Request state<br \/>\n&#8211; Correlation ID, tenant, auth, flags, deterministic routing input. TTL: minutes.<br \/>\n&#8211; Store: in headers and a lightweight KV like Redis.<\/p>\n<p>2) Session state<br \/>\n&#8211; Conversation turns and tool call results for a live session. TTL: hours to days.<br \/>\n&#8211; Store: Redis or Dynamo with versioned upserts. Summarize aggressively.<\/p>\n<p>3) Domain state<br \/>\n&#8211; Facts, profiles, documents, knowledge. TTL: weeks to forever, but versioned.<br \/>\n&#8211; Store: primary DB plus vector DB, both namespaced and access-controlled.<\/p>\n<p>4) Model policy state<br \/>\n&#8211; System prompts, routing rules, tools registry, safety config. TTL: deployed versions.<br \/>\n&#8211; Store: config service or feature flags with immutable versions.<\/p>\n<p>Core pattern:<br \/>\n&#8211; Workers are stateless. They build context on each request using a Context Builder service that pulls from the stores above and enforces token budgets.<br \/>\n&#8211; All writes go through an Event Log or an append-only audit trail. Readers materialize views into session and domain stores. You can replay or debug.<br \/>\n&#8211; Use idempotency keys for tool execution. Key = session_id + step_index + tool_name.<br \/>\n&#8211; Apply optimistic concurrency for session memory. Version field with compare-and-swap so two parallel turns do not clobber each other.<\/p>\n<p>Trade-offs and failure modes:<br \/>\n&#8211; Latency tax from state hydration. If your Context Builder hits 4 stores per request, your p95 suffers. Fix with batching and prefetch.<br \/>\n&#8211; Stale memory. Eventual consistency can surface outdated facts. For critical operations use a read-your-writes path or lock a short-lived session sequencer.<br \/>\n&#8211; Memory bloat. Unbounded transcripts and embedding every message will quietly 10x your bill. Cap, summarize, dedupe.<br \/>\n&#8211; Cross-tenant leakage. Misused vector DB filters or shared namespaces can leak content. Namespaces should be enforced at the SDK and schema level.<br \/>\n&#8211; Retry storms. Without idempotency, retries call tools again, charge external APIs twice, and double-insert state.<\/p>\n<h3>What a good request looks like<\/h3>\n<ul>\n<li>Request arrives with correlation_id, tenant_id, session_id.<\/li>\n<li>Context Builder fetches:\n<ul>\n<li>Short-term session summary under 1k tokens<\/li>\n<li>Top K retrieved domain facts using a deterministic query seeded with the user intent<\/li>\n<li>User profile slice (structured attributes only)<\/li>\n<li>Policy prompt for this route, pinned by version<\/li>\n<\/ul>\n<\/li>\n<li>Context budgeter enforces quotas per source. If over budget, it drops least-relevant chunks or switches to a coarser summary.<\/li>\n<li>LLM call executes. Tool calls carry idempotency keys and write outputs as events.<\/li>\n<li>Session store is updated via CAS with the new turn and a refreshed summary.<\/li>\n<li>Observability records: tokens_in, tokens_from_state, retrieval_hits, latency per store, and tool retries.<\/li>\n<\/ul>\n<h2>Practical solutions I keep reusing<\/h2>\n<p>1) Make state explicit in the API<br \/>\n&#8211; Define input contract: what state is allowed and how much of it. Anything not in the contract does not reach the prompt.<br \/>\n&#8211; Version prompts and tools. Never hot-edit a system prompt without a new version.<\/p>\n<p>2) Two-tier memory<br \/>\n&#8211; Tier A: short-term in Redis, TTL hours, keep last N turns and a rolling 1k token summary.<br \/>\n&#8211; Tier B: long-term in DB + vector DB. Store structured facts, not raw chat. Summarization jobs convert transcripts into facts with citations.<\/p>\n<p>3) Context budgeter as a gatekeeper<br \/>\n&#8211; Hard caps: max tokens per source, max chunks per retrieval, max profile size.<br \/>\n&#8211; Soft caps: decay older facts unless pinned. If you think you need 20k-token context every turn, you probably need better summarization.<\/p>\n<p>4) Idempotency and concurrency<br \/>\n&#8211; Every tool call has an idempotency key. On retry, you look up results and return them.<br \/>\n&#8211; Session writes use compare-and-swap with a version. If conflict, rehydrate and reapply.<\/p>\n<p>5) Deterministic retrieval<br \/>\n&#8211; Retrieval should be stable for the same input. Use query templates and tie-breakers. Randomness in retrieval destroys debuggability.<\/p>\n<p>6) Observability on state, not just model<br \/>\n&#8211; Emit state metrics: tokens_from_history, tokens_from_docs, retrieval_hit_rate, state_load_ms, state_write_ms.<br \/>\n&#8211; Track token tax ratio: tokens_from_state divided by total tokens. Keep it under 30 percent for most apps.<\/p>\n<p>7) Data safety by default<br \/>\n&#8211; Per-tenant namespaces everywhere. Encrypt at rest. Do not store raw PII in vector DB; store IDs and fetch the PII from the primary DB on demand.<\/p>\n<p>8) Output caching with a state fingerprint<br \/>\n&#8211; Cache key = normalized question + fingerprint(state_version, retrieval_ids, policy_version).<br \/>\n&#8211; Gains are huge for repeated Q&amp;A and analytics workloads.<\/p>\n<p>9) Summarize with structure, not prose<br \/>\n&#8211; Summaries should be key facts plus citations and a recency score. Prose summaries drift and are hard to diff.<\/p>\n<p>10) Cold starts and tail latency<br \/>\n&#8211; Warm the Context Builder with connection pools and batched lookups. For chat, prefetch likely next-turn state after sending the previous answer.<\/p>\n<h2>When to pick stateless, stateful, or hybrid<\/h2>\n<ul>\n<li>Fully stateless workers, no product memory\n<ul>\n<li>Good for: one-shot classification, simple search answers.<\/li>\n<li>Pros: easy to scale, low risk.<\/li>\n<li>Cons: zero personalization or continuity.<\/li>\n<\/ul>\n<\/li>\n<li>In-memory sticky sessions\n<ul>\n<li>Good for: small internal tools, low concurrency.<\/li>\n<li>Pros: low latency.<\/li>\n<li>Cons: no HA, hard to autoscale, failover loses memory. I avoid this in production.<\/li>\n<\/ul>\n<\/li>\n<li>Session store with TTL plus domain stores (hybrid)\n<ul>\n<li>Good for: most real apps. This is the default.<\/li>\n<li>Pros: operationally stateless workers with rich product memory.<\/li>\n<li>Cons: requires discipline on budgets, versioning, and observability.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Cost and performance math that changes decisions<\/h2>\n<p>You do not need perfect numbers. You need orders of magnitude.<\/p>\n<p>Assumptions for illustration only:<br \/>\n&#8211; Input tokens cost 5 dollars per 1M tokens.<br \/>\n&#8211; Output tokens cost 15 dollars per 1M tokens.<\/p>\n<p>Scenario A: push full 12k-token history every turn<br \/>\n&#8211; 50k sessions per day, 6 turns per session.<br \/>\n&#8211; Input tokens from history alone: 12k x 6 x 50k = 3.6B tokens.<br \/>\n&#8211; Cost: 3.6B x 5 \/ 1M = 18k dollars per day. About 540k per month, just for repeated history.<\/p>\n<p>Scenario B: 1k summary + 2k fresh context per turn<br \/>\n&#8211; 50k sessions per day, 6 turns per session.<br \/>\n&#8211; Input tokens from history: 1k x 6 x 50k = 300M.<br \/>\n&#8211; Cost: 300M x 5 \/ 1M = 1.5k dollars per day. About 45k per month.<\/p>\n<p>That gap funds your retrieval infra and observability and still leaves money on the table. Also, shorter prompts are faster and usually more accurate because noise is lower.<\/p>\n<p>Latency impacts:<br \/>\n&#8211; Hydration using 4 stores at 40 ms each in p50 is not 160 ms in p95. Long tails compound. Batch, parallelize, and prefer one round trip per store.<\/p>\n<p>Scaling risks:<br \/>\n&#8211; Token tax grows with engagement. If you do not cap it, growth turns into margin compression. Make the budgeter a hard gate.<\/p>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Keep workers stateless. Keep data stateful. Glue them with a strict, metered Context Builder.<\/li>\n<li>Treat memory as types of state with lifetimes, not a transcript blob.<\/li>\n<li>Use idempotency keys and versioned writes to stop retries from corrupting state.<\/li>\n<li>Enforce token budgets per source. Summarize to structure with citations.<\/li>\n<li>Observe state metrics. Token tax ratio under 30 percent is a good rule of thumb.<\/li>\n<li>Namespaces and access control must be enforced in every store and at the SDK.<\/li>\n<li>Output caching needs a state fingerprint, not just the question string.<\/li>\n<\/ul>\n<h2>If you are wrestling with this in production<\/h2>\n<p>If you are seeing token creep, flaky memory, or tail-latency spikes from state hydration, you are not alone. I help teams redesign the state boundary, put guardrails on context, and get cost and latency under control without gutting product quality. If you want a second set of eyes on your architecture or a quick cost model, reach out.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The fastest way to blow your LLM budget The fastest way to blow your LLM budget is to keep shoving yesterday&#8217;s conversation back into the prompt on every turn. I&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3],"tags":[17,16,15],"class_list":["post-62","post","type-post","status-publish","format-standard","hentry","category-ai-architecture","tag-ai-cost","tag-ai-scalability","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/62","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=62"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/62\/revisions"}],"predecessor-version":[{"id":82,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/62\/revisions\/82"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=62"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=62"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=62"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}