{"id":76,"date":"2025-03-19T14:27:08","date_gmt":"2025-03-19T14:27:08","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/03\/19\/token-costs-what-actually-moves-the-needle-in-production\/"},"modified":"2026-04-10T19:29:37","modified_gmt":"2026-04-10T19:29:37","slug":"token-costs-what-actually-moves-the-needle-in-production","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/03\/19\/token-costs-what-actually-moves-the-needle-in-production\/","title":{"rendered":"Token costs: what actually moves the needle in production"},"content":{"rendered":"<h2>The real problem<\/h2>\n<p>If your LLM bill surprised you last month, it probably was not the flashy features. It was the quiet stuff you never show the user: bloated system prompts, oversized retrieval chunks, tool outputs pasted back into the model, and history that keeps getting re-injected on every turn.<\/p>\n<p>On one client system, 72 percent of tokens per request were invisible to the user. They were paying for embeddings to over-retrieve, a fat system message, 10 pages of doc snippets, previous 8 turns, and a verbose JSON tool response the model never needed. Everyone was trying to shorten the final answer. Wrong target.<\/p>\n<h2>Where token waste shows up and why<\/h2>\n<ul>\n<li>Chat assistants with memory: history grows linearly with each turn if you do not summarize or anchor facts<\/li>\n<li>RAG: embeddings over-retrieve, chunking is off, no re-rankers, and you paste full pages<\/li>\n<li>Tools: you call a data API and dump 5k tokens of JSON back into the LLM<\/li>\n<li>Multi-agent and evaluators: extra passes without a token budget or exit criteria<\/li>\n<li>Safety and formatting: legal disclaimers and markdown templates added to every turn<\/li>\n<\/ul>\n<p>What most teams misunderstand<\/p>\n<ul>\n<li>Output is not the main driver. Input often dominates. Output tokens are pricey, but your system prompt plus context often beats the answer length by 2 to 5x<\/li>\n<li>Long context models are not free. The convenience tax is real. 200k context can silently triple cost even if you only use 10k<\/li>\n<li>Chunk size and overlap are not cosmetic. They decide if you pay for 2 pages or 20<\/li>\n<li>Caching saves real money only if you keep prompts stable. Random IDs and timestamps kill cache hits<\/li>\n<\/ul>\n<h2>Technical deep dive: what actually drives cost<\/h2>\n<p>Think in layers. A single call usually looks like this:<\/p>\n<p>1) Static system content<br \/>\n&#8211; Policy text, tone, tooling instructions<br \/>\n&#8211; Typical waste: 800 to 3k tokens every turn because no one separates static and dynamic parts or caches them<\/p>\n<p>2) Conversation state<br \/>\n&#8211; Full transcript vs compact state<br \/>\n&#8211; Typical waste: linear growth per turn when you reattach every message<\/p>\n<p>3) Retrieval context<br \/>\n&#8211; Top K chunks, often from page-level retrieval with large overlap<br \/>\n&#8211; Typical waste: 3k to 10k tokens because of aggressive K, no re-rank, and pasting full documents<\/p>\n<p>4) Tool results<br \/>\n&#8211; JSON blobs, CSV dumps, screenshots turned into captions<br \/>\n&#8211; Typical waste: 1k to 6k tokens because you returned full rows and all fields<\/p>\n<p>5) Model output<br \/>\n&#8211; Usually shorter than everything above, yet gets all the blame<\/p>\n<p>Trade-offs you should acknowledge<\/p>\n<ul>\n<li>Smaller model plus a planner pass can beat a single big model with a huge context, but only if you cap retries and control what goes to each pass<\/li>\n<li>Aggressive summarization reduces spend but risks drift. Extractive compression and quoting spans is safer than abstractive paraphrasing for compliance<\/li>\n<li>Long-context models simplify engineering but hide retrieval mistakes and cost more per token. They also invite lazy prompting<\/li>\n<\/ul>\n<p>Failure modes I keep seeing<\/p>\n<ul>\n<li>Summaries drift facts after 4 or 5 turns, then your agent confidently acts on fiction<\/li>\n<li>Caching configured but invalidated by non-deterministic whitespace, timestamps, or shuffled tool lists<\/li>\n<li>Function calling loops where the model keeps asking the same tool with slight parameter variations<\/li>\n<li>Over-truncation that cuts citations, triggering hallucinations and higher retries later<\/li>\n<\/ul>\n<h2>Practical fixes that move the bill<\/h2>\n<p>1) Put a hard token budget per request<br \/>\n&#8211; Example: 2.5k input, 400 output. Split input as 600 system, 600 history summary, 1.0k retrieval, 300 tool results<br \/>\n&#8211; If any layer breaches its budget, trim that layer first instead of truncating blindly at the end<\/p>\n<p>2) Stop paying for fat system prompts every turn<br \/>\n&#8211; Split system content into static core and per-request deltas<br \/>\n&#8211; Use <a href=\"https:\/\/angirash.in\/blog\/2025\/03\/19\/token-costs-what-actually-moves-the-needle-in-production\/\">provider prompt caching<\/a> where available and keep the static segment byte-identical<br \/>\n&#8211; Remove policies that can live in code. Keep only what the model must read<\/p>\n<p>3) <a href=\"https:\/\/angirash.in\/blog\/2025\/02\/24\/chunking-that-actually-improves-retrieval\/\">Fix retrieval before you tune prompts<\/a><br \/>\n&#8211; Use small chunks with low overlap. 600 to 1,200 tokens per chunk is a good starting band<br \/>\n&#8211; Add a cross-encoder re-ranker to cut K down to 3 to 5 chunks, not 10 to 20<br \/>\n&#8211; Extractive compression: include only quoted spans + a short header per chunk<br \/>\n&#8211; Prefer section-level retrieval over page-level. Pages inflate tokens and dilute relevance<br \/>\n&#8211; If you need summaries, compress with a cheaper model and keep links back to original spans<\/p>\n<p>4) Stop dumping tool results into the LLM<br \/>\n&#8211; Ask tools to return only the fields you need<br \/>\n&#8211; Cap rows. If you need to reason about aggregates, compute them in code<br \/>\n&#8211; For structured tasks, use function calling to request precise fields, then render user-facing text on the server<\/p>\n<p>5) Conversation state that does not grow unbounded<br \/>\n&#8211; Maintain a rolling summary of facts + decisions. Keep references to full messages off-model<br \/>\n&#8211; Store machine-readable state separately. Only pass diffs or keys, never the entire object each turn<br \/>\n&#8211; TTL older turns aggressively unless the user explicitly asks to revisit<\/p>\n<p>6) Output control that the model actually follows<br \/>\n&#8211; Prefer JSON schema or function calling with enums and short field names over free text<br \/>\n&#8211; Set stop sequences to cut standard boilerplate. For example, stop at \\n\\n if you only want a single paragraph<br \/>\n&#8211; Max tokens should be set. Do not let the model free run because you are scared of truncation. If the task needs more, plan multiple short calls<\/p>\n<p>7) Router patterns that earn their keep<br \/>\n&#8211; Use a small model for classification and task routing<br \/>\n&#8211; Only escalate to a large model when uncertainty or task type requires it<br \/>\n&#8211; Make the router cheap and deterministic. Thresholds should be tuned offline with evals, not in prod by vibes<\/p>\n<p>8) Make caching real, not theoretical<br \/>\n&#8211; Keep cacheable segments stable: same tool order, no timestamps, no random IDs<br \/>\n&#8211; Providers now support partial message caching. Segment prompts so that static chunks are cacheable<br \/>\n&#8211; Log cache hit rates by path. If you are not seeing 30 to 60 percent hits on static-heavy prompts, you are probably invalidating accidentally<\/p>\n<p>9) Tokenization discipline<br \/>\n&#8211; Measure real token counts per layer with the exact tokenizer used by the provider<br \/>\n&#8211; Strip boilerplate. Collapse whitespace and markdown noise<br \/>\n&#8211; Replace verbose labels with short keys. Replace repeated legal text with a single reference line if policy allows<br \/>\n&#8211; Avoid base64 or large inline binary-in-text. Pass references and fetch out of band<\/p>\n<p>10) Multimodal is a silent tax<br \/>\n&#8211; Images often explode into long captions. Downsample, crop to regions, or OCR to text if you only need fields<br \/>\n&#8211; Do not send 10 screenshots when 2 will do. One client cut 70 percent of vision tokens by cropping to the table region they actually needed<\/p>\n<p>11) Retry strategy with budgets<br \/>\n&#8211; Cap total tokens per user action across retries<br \/>\n&#8211; Retry with a smaller context, not the same bloated one<br \/>\n&#8211; Add loop guards for function calling. Detect repetitive tool requests and stop<\/p>\n<h2>A quick architecture sketch that works<\/h2>\n<ul>\n<li>Pre-step: Router on small model determines task type<\/li>\n<li>Retrieval: BM25 + embeddings recall to 50, cross-encoder re-rank to top 5, extractive compress to 900 tokens total<\/li>\n<li>Prompt: Static system core cached, compact state, minimal tools, compressed context<\/li>\n<li>Call 1: Big model for reasoning, function call for any API I\/O<\/li>\n<li>Call 2: Optional verifier on small model to check constraints. If fail, fix with delta-only context<\/li>\n<li>Output: Structured JSON to server, server renders user text<\/li>\n<\/ul>\n<p>This pattern consistently drops input tokens by 40 to 70 percent without quality loss. In some cases, quality improves because the model is not drowning in irrelevant context.<\/p>\n<h2>What this looks like in numbers<\/h2>\n<p>A real case from Q4:<\/p>\n<ul>\n<li>Before: 6.8k input, 900 output per turn, $0.010 per 1k input, $0.030 per 1k output. Cost per turn about $0.12<\/li>\n<li>After: 2.1k input, 450 output per turn. Cost per turn about $0.043<\/li>\n<li>Daily 80k turns saved roughly $6,100 per day and improved P95 latency by 28 percent since streaming fewer tokens is faster<\/li>\n<\/ul>\n<p>Your numbers will differ by provider, but the pattern holds. The big win is almost always input reduction, not squeezing the final answer.<\/p>\n<h2>Business impact and scaling risks<\/h2>\n<ul>\n<li>Predictability: Without per-layer budgets, costs scale superlinearly with usage spikes. FinOps will hunt you<\/li>\n<li>Latency: Every 1k fewer tokens shaves real time. That compounds across multi-pass pipelines<\/li>\n<li>Quality: Long context is not free quality. It raises cost and hides retrieval bugs. A tight context with good re-ranking usually beats a massive paste<\/li>\n<li>Risk: Summarization drift can contaminate decisions. If you must summarize, keep references and use extractive compression for critical data<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Set a token budget per request and enforce it per layer<\/li>\n<li>Cache static system content and keep it byte-identical<\/li>\n<li>Fix retrieval first. Re-rank and compress instead of pasting pages<\/li>\n<li>Keep tool outputs tiny. Compute in code, not in the LLM<\/li>\n<li>Maintain compact conversation state with TTL and diffs<\/li>\n<li>Use structured outputs, stop sequences, and strict max_tokens<\/li>\n<li>Route to small models by default and escalate only when needed<\/li>\n<li>Track cache hit rate, token mix by layer, and retries per request<\/li>\n<\/ul>\n<h2>If this resonates<\/h2>\n<p>If your LLM bill is driven by invisible tokens and the fixes above feel doable but not trivial in your stack, that is normal. I help teams design token-aware architectures, set budgets that hold in production, and bring costs down without gutting quality. If you are running into similar issues, this is exactly the kind of thing I help teams fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The real problem If your LLM bill surprised you last month, it probably was not the flashy features. It was the quiet stuff you never show the user: bloated system&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7],"tags":[17,15,14],"class_list":["post-76","post","type-post","status-publish","format-standard","hentry","category-ai-cost-optimization","tag-ai-cost","tag-ai-system-design","tag-llm"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/76","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=76"}],"version-history":[{"count":2,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/76\/revisions"}],"predecessor-version":[{"id":203,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/76\/revisions\/203"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=76"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=76"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=76"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}