{"id":75,"date":"2025-03-03T14:12:09","date_gmt":"2025-03-03T14:12:09","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/08\/03\/where-your-ai-budget-quietly-leaks\/"},"modified":"2026-04-10T19:29:14","modified_gmt":"2026-04-10T19:29:14","slug":"where-your-ai-budget-quietly-leaks","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/03\/03\/where-your-ai-budget-quietly-leaks\/","title":{"rendered":"Where Your AI Budget Quietly Leaks (and How to Plug It)"},"content":{"rendered":"<h2>The quiet bleed<\/h2>\n<p>Most AI invoices don\u2019t explode. They bleed. A few extra tokens here, a lazy top_k there, a GPU pool idling at 6 percent because someone hard-coded min replicas. You won\u2019t notice in a PoC. You will at 10x traffic, when finance asks why the margin on your \u201cAI feature\u201d is upside down.<\/p>\n<p>If your unit economics feel fuzzy, this is for you.<\/p>\n<h2>Where the waste shows up (and why)<\/h2>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/19\/token-costs-what-actually-moves-the-needle-in-production\/\">Token bloat in prompts and retrieval<\/a>: 3\u20135x more context than needed, default max_tokens left high, top_k=20 with chunky overlaps. Looks harmless, scales brutally.<\/li>\n<li>Over-modeling: using a flagship LLM for classification, routing, or formatting. Latency and cost both suffer.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/06\/01\/hidden-bottlenecks-multi-agent-ai-systems\/\">Agent loops and orchestration retries<\/a>: helpful on paper, but a bad tool schema or no step cap turns into runaway bills.<\/li>\n<li>Vector DB churn: re-embedding the world, wide-dimensional embeddings for simple tasks, storing full documents in the index instead of pointers.<\/li>\n<li>Idle or misfit infra: <a href=\"https:\/\/angirash.in\/blog\/2025\/08\/14\/true-cost-self-hosting-llms-vs-apis\/\">GPUs online 24\/7 for daytime traffic, no batching<\/a>, running small models on A100s, provisioned concurrency never scaling down.<\/li>\n<li>Blind spots: <a href=\"https:\/\/angirash.in\/blog\/2025\/02\/18\/why-scaling-ai-systems-increases-cost-nonlinearly\/\">no per-request cost tracing<\/a>, no dashboards for cost by feature, no idea which prompts or tenants are burning money.<\/li>\n<\/ul>\n<p>This happens because production defaults reward convenience. Frameworks optimize for \u201cit works,\u201d not \u201cit\u2019s sustainable.\u201d Most teams also misunderstand where cost actually accrues. Context is usually more expensive than you think. Tool calls are often the hidden multiplier. And retrieval quality, not model size, is what cuts cost the most.<\/p>\n<h2>The technical deep dive: where money leaks in real systems<\/h2>\n<h3>1) Prompt and token footprint<\/h3>\n<ul>\n<li>Fat system prompts that repeat policy paragraphs on every call.<\/li>\n<li>Overlong contexts from RAG due to large chunks (1k+ tokens) with 20\u201330% overlap.<\/li>\n<li>Top_k too high with no MMR or dedupe, so the LLM reads near-duplicates.<\/li>\n<li>Default max_tokens set to 1024 when the response needs ~120. No stop sequences, so the model rambles.<\/li>\n<li>Verbose tool schemas: returning full JSON blobs where a small ID would do.<\/li>\n<\/ul>\n<p>Failure modes: latency spikes, per-request cost variance, blown-out tails under load when batching meets oversized prompts.<\/p>\n<p>Trade-off: aggressive truncation can degrade quality if you don\u2019t evaluate. But most orgs have 30\u201350% pure fluff.<\/p>\n<h3>2) Orchestration and agents<\/h3>\n<ul>\n<li>Multi-step chains that call the LLM for simple glue logic (routing, validation) instead of lightweight heuristics.<\/li>\n<li>No guardrails on tool use. Agents plan-replan loops with no step limit or cost ceiling.<\/li>\n<li>Naive retry logic that replays full prompts on transient errors without jitter or circuit breakers.<\/li>\n<\/ul>\n<p>Failure modes: n\u00d7 cost multipliers, hard-to-reproduce bills. Feels like \u201cwe got more accuracy,\u201d but real gains are unclear. one <a href=\"https:\/\/angirash.in\/blog\/2025\/06\/15\/common-mistakes-in-ai-architecture-design\/\">common ai architecture pitfalls<\/a> is not properly accounting for data quality, which can lead to misleading results. Moreover, neglecting to consider scalability can result in systems that crumble under increased loads. As projects grow, the initial design flaws become more pronounced, leading to significant rework and frustration.<\/p>\n<h3>3) Retrieval and embeddings<\/h3>\n<ul>\n<li>Using high-dim embeddings (e.g., 1.5k dims) for problems where 384 works fine.<\/li>\n<li>Storing raw documents in the vector DB. You pay for what you store and what you pull back.<\/li>\n<li>Frequent re-embeddings for minor content edits; no TTL, no diff-based updates.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/02\/24\/chunking-that-actually-improves-retrieval\/\">Overlap-heavy chunking that inflates index size<\/a> and retrieval redundancy.<\/li>\n<\/ul>\n<p>Failure modes: vector write\/read bills grow faster than traffic. More retrieved text worsens token cost downstream.<\/p>\n<p>Trade-off: smaller embeddings reduce index cost but may lower recall if your domain needs nuance. Measure it; don\u2019t guess.<\/p>\n<h3>4) Model selection<\/h3>\n<ul>\n<li>Always picking a flagship model because \u201cquality.\u201d Many tasks are routing, formatting, light summarization. A small or mid-tier model is enough.<\/li>\n<li>No routing layer. Every feature uses the same model regardless of input complexity.<\/li>\n<li>Ignoring early-exit strategies: ask a small model first and escalate only when confidence is low.<\/li>\n<\/ul>\n<p>Failure modes: cost scales with traffic linearly when it could be sub-linear with a cascade.<\/p>\n<h3>5) Infrastructure and runtime<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/11\/21\/real-cost-breakdown-llm-apps-on-aws\/\">Provisioned GPU fleets with fixed min nodes<\/a>. Day-night patterns single-handedly waste budget.<\/li>\n<li>No batching on server-side inference; QPS increases model costs superlinearly due to poor utilization.<\/li>\n<li>Running open models unquantized on expensive GPUs where CPU or low-tier GPU would be fine.<\/li>\n<li>Logging every token and payload at debug level into a hot storage tier. You\u2019re paying for observability, twice.<\/li>\n<\/ul>\n<p>Failure modes: high idle burn, unpredictable p99, infra cost dwarfing model bills at steady state.<\/p>\n<h3>6) Data egress and vendor edges<\/h3>\n<ul>\n<li>Cross-region LLM calls with results piped back to your VPC. Egress can be non-trivial at scale.<\/li>\n<li>Vector DB backups and replicas left at defaults. You rarely need triple replication for a knowledge base you can rebuild.<\/li>\n<li>Object store GET storms during RAG because chunks store full text rather than references.<\/li>\n<\/ul>\n<h2>Practical fixes that actually work<\/h2>\n<h3>Put a budget on every request path<\/h3>\n<ul>\n<li>Define a per-request cost SLO by feature. Example: support answer &lt;= $0.005, contract summary &lt;= $0.02.<\/li>\n<li>Break down cost in traces: model input\/output tokens, vector reads\/writes, egress, tool API costs. Tag by tenant and feature.<\/li>\n<\/ul>\n<h3>Slim the prompt, control the hose<\/h3>\n<ul>\n<li>Compress system prompts. Move policy to a short instruction; keep long policy server-side for auditing, not in every call.<\/li>\n<li>Set realistic max_tokens, add stop sequences. Enforce output length contracts when possible.<\/li>\n<li>Reduce top_k. Start at 4\u20136 with MMR and dedupe. If quality drops, fix retrieval first.<\/li>\n<li>Return IDs from tools, not full payloads. Fetch details only if needed downstream.<\/li>\n<\/ul>\n<h3>Tune retrieval like it\u2019s a search system (because it is)<\/h3>\n<ul>\n<li>Chunk by semantic boundaries, keep overlap minimal (&lt;= 10\u201315%) unless your domain truly needs continuity.<\/li>\n<li>Hybrid search with filters beats cranking top_k. Use metadata filters to cut noise.<\/li>\n<li>Pre-rank candidates via lightweight heuristics and feed fewer passages to the LLM.<\/li>\n<li>Store references in the vector DB; keep raw text in object storage. Pull only what you need.<\/li>\n<\/ul>\n<h3>Right-size embeddings<\/h3>\n<ul>\n<li>Use smaller-dim models for general semantic search; measure recall and MRR against a labeled set.<\/li>\n<li>Batch embeddings and enable diff\/TTL for re-embeds. Don\u2019t re-embed a 100k corpus for a typo.<\/li>\n<\/ul>\n<h3>Route models, don\u2019t worship them<\/h3>\n<ul>\n<li>Introduce a router: small model first, escalate to larger only when confidence is low or the task class demands it.<\/li>\n<li>Classify tasks cheaply: formatting, extraction, and routing rarely need premium models.<\/li>\n<li>Shadow test cheaper models on production traffic. Track acceptance rate and objection-worthy errors.<\/li>\n<\/ul>\n<h3>Stop agent runaways<\/h3>\n<ul>\n<li>Cap steps and set a per-interaction cost ceiling. Hard stops beat surprise invoices.<\/li>\n<li>Provide tight tool schemas with clear preconditions. Disallow free-form tool arguments.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/caching-strategies-for-llm-systems-that-actually-work\/\">Cache tool results by normalized inputs<\/a>. Many external lookups are repeatable.<\/li>\n<\/ul>\n<h3>Make infra earn its keep<\/h3>\n<ul>\n<li>Use server runtimes that support batching and paged attention for open models. vLLM-class servers change unit economics.<\/li>\n<li>Quantize where possible (AWQ\/GPTQ\/8-bit KV). Validate quality on your eval set.<\/li>\n<li>Scale to zero for low-duty features. Separate background jobs from interactive paths to keep concurrency sane.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/18\/gpu-vs-cpu-ai-cost-performance-tradeoffs\/\">Right-size nodes<\/a>. Don\u2019t serve 7B models on the same hardware as 70B unless you have a reason.<\/li>\n<\/ul>\n<h3>Instrument cost like a first-class SLO<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/11\/12\/ai-observability-stop-guessing-start-instrumenting\/\">Add cost to tracing spans<\/a>. Emit model provider cost headers and your own estimates if headers are missing.<\/li>\n<li>Dashboards: cost by feature, tenant, route decision, and request size buckets. Alert on drift.<\/li>\n<li>Keep an eval set aligned to your use cases. Re-run after any change that affects tokens, routing, or retrieval.<\/li>\n<\/ul>\n<h3>Procurement and vendor hygiene<\/h3>\n<ul>\n<li>Negotiate committed use for high-volume endpoints. Check region alignment to minimize egress.<\/li>\n<li>Review vector DB replica and backup policies quarterly. Aim for what you actually need, not what the default wants.<\/li>\n<li>Watch logging\/storage tiers. Move verbose logs to cold storage quickly.<\/li>\n<\/ul>\n<h2>Business impact you can bank<\/h2>\n<p>I\u2019ve seen these changes deliver, repeatedly:<\/p>\n<ul>\n<li>Prompt and retrieval diet: 25\u201350% cost reduction, often with better accuracy because you removed noise.<\/li>\n<li>Model routing cascades: 30\u201370% lower model spend with negligible quality loss when tuned on real evals.<\/li>\n<li>Infra right-sizing and batching: 2\u20135x throughput per node, which means fewer nodes or better headroom.<\/li>\n<li>Embedding and index fixes: 40\u201360% lower vector DB costs and faster queries.<\/li>\n<\/ul>\n<p>Latency usually improves because you\u2019re moving less data and making fewer hops. The main scaling risk you remove is linear cost growth with traffic. With cascades, batching, and slimmer prompts, cost growth bends.<\/p>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Your biggest line item is almost always unnecessary tokens, not the model list price.<\/li>\n<li>Retrieval quality controls cost. Fix RAG before you blame the LLM.<\/li>\n<li>Small-first model routing beats one-size-fits-all. Measure, then escalate.<\/li>\n<li>Agent loops need hard limits and tool discipline.<\/li>\n<li>Batch, quantize, and scale down. Idle infra is silent budget burn.<\/li>\n<li>Add cost to your traces. If you can\u2019t see cost per request, you can\u2019t manage it.<\/li>\n<\/ul>\n<h2>If this sounds familiar<\/h2>\n<p>If you\u2019re staring at a bill that doesn\u2019t match your roadmap, or quality dips whenever you cut costs, I\u2019ve been there with other teams. This is exactly the kind of thing I help fix when systems start breaking at scale. Happy to take a look at your stack and find the easy wins before you refactor the world.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The quiet bleed Most AI invoices don\u2019t explode. They bleed. A few extra tokens here, a lazy top_k there, a GPU pool idling at 6 percent because someone hard-coded min&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7],"tags":[17,22,21],"class_list":["post-75","post","type-post","status-publish","format-standard","hentry","category-ai-cost-optimization","tag-ai-cost","tag-ai-observability","tag-llmops"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/75","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=75"}],"version-history":[{"count":3,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/75\/revisions"}],"predecessor-version":[{"id":193,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/75\/revisions\/193"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=75"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=75"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=75"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}