The uncomfortable truth about scaling AI
Your POC looks cheap. A few cents per request. Then you ship to 100k users, layer in retrieval, add tool use, tighten SLOs, and the monthly bill looks nothing like linear growth. I have watched teams double traffic and see 3 to 5x COGS. Not because of vendor bait and switch, but because the system architecture amplifies cost as it scales.
This post breaks down why that happens and what to change before it shows up on your invoice.
Where the nonlinearity shows up
- Fan out grows faster than you think. Tool calls, RAG, moderation, function-router chains. Small multipliers stack.
- Tail latency SLOs force headroom and kill batching. You pay for idle and waste throughput.
- Token growth sneaks in. Prompts get longer, context windows increase, generation limits creep.
- Retrieval infrastructure replicates. More tenants and segments mean more replicas, more cache misses, more egress.
- Safety and evaluation layers compound. One more checker per request seems harmless until you run 1k RPS.
What most teams miss: the expensive part is not the single model call. It is the orchestration and the tail effects that break your efficiency curves.
Why it happens in real systems
1) Fan out amplification
A simple mental model:
– Base request calls LLM once.
– Add routing to pick a model: +1 call in 30% of traffic.
– Add RAG with top-k 8: each request does 1 vector query, 1 re-ranker call, and the LLM input inflates.
– Add tool use with planner + executor: average 1.4 extra model calls per request.
– Add moderation pre and post: +2 classifier calls.
Your effective calls per user are not 1. They are 1.0 to 1.2 to 1.6 to 2.4 depending on path. Even if each component is cheap, concurrency and token growth make it expensive at scale.
Failure mode: retry storms. At 99p timeouts you fan out again, multiply cost and blow SLOs.
2) Tail latency and utilization collapse
When you commit to p95 or p99 latency, you need slack capacity. Slack breaks batching. With smaller batches, your token throughput per GPU drops. You end up with 30 to 50% lower throughput compared to lab numbers. That gap is your nonlinearity.
I have seen teams push p95 from 1.4s to 900ms and watch GPU cost jump 40% for the same RPS. Not because the model got slower, but because batching and context padding fell apart.
3) Context growth and padding waste
Inference cost is roughly linear in total tokens, but the memory footprint for KV cache and the variance in sequence lengths wrecks packing efficiency. Longer and more variable inputs lead to more padding and fragmented batches. If you turn on larger context windows, you also push out memory, reduce batch size, then pay with more instances.
4) Retrieval cost stacks in the background
- Vector DB replicas increase with tenants and segmentation for compliance.
- Re-ranking models add tokens and latency on hot paths.
- Egress and compression overheads for moving chunks into the model are not free.
- Index refresh and embedding backfills compete with online queries and push you to overprovision.
5) Safety, evaluation, and governance
Moderation, PII redaction, content filters, and offline evals are cheap per call but constant per request. Once you hit high throughput, they show up as a real percentage of COGS.
6) Autoscaling and cold starts
Autoscalers react to CPU or requests per second, not tokens per second. Your workloads are token bound. You scale late, miss batching windows, and pay with headroom. Cold starts on GPUs are painful. Warm pools cost money. Both directions are nonlinear with traffic shifts.
What most teams misunderstand
- The model is not the system. Orchestration and data paths dominate at scale.
- Average latency is not the constraint. Tail latency sets capacity and batching.
- Token pricing hides utilization losses. Two systems with the same token count can have 2x different COGS.
- Caching is not a silver bullet if personalization and freshness requirements are high.
Technical deep dive: the real cost function
A rough per-request cost model I use in reviews:
Cost ≈ LLM(tokens_in + tokens_out) × price_per_token × 1/E + FanOut × C_other + SLO_headroom × Infra_overhead
- E is effective batching efficiency. It drops with tail SLOs, variance of sequence lengths, and mixed tenant QoS.
- FanOut captures moderation, routing, tools, re-ranking, retries.
- SLO_headroom is the reserved capacity to meet p95 or p99. It is invisible on dev boxes and very visible on your invoice.
Failure modes in the wild:
– Retry amplification, especially with client-side exponential backoff and server-side async queues. Doubles cost in peak.
– Unbounded tool loops. Planner calls itself and burns tokens.
– Top-k set too high by default. User sees no benefit, you pay for chunks and re-ranks.
– Embedding storms during backfills. Online traffic competes with batch jobs, autoscaler panics.
Practical fixes that actually move the needle
Control fan out
- Put a hard budget on model calls per request. Abort on budget breach with a graceful message.
- Gate tools behind cheap classifiers or heuristics. Only plan when the query is actually complex.
- Disable moderation on trusted internal tenants, keep it for public traffic.
- Collapse routers. Prefer deterministic routing rules to model-based routers when possible.
In one deployment, adding a request budget and a tool-use gate cut average LLM calls from 2.1 to 1.3 and reduced COGS 28% with zero quality drop.
Constrain context without hurting quality
- Separate instructions from history. Do not resend the whole transcript.
- Summarize history aggressively with a smaller local model.
- Use retrieval filters before top-k. Aim for k=3 to 5, not 8 to 20. Add a confidence threshold to skip RAG entirely when irrelevant.
- Apply prompt compression or template discipline. Stop shipping verbose system prompts.
Tiered model routing that is predictable
- Route 60 to 80% of traffic to a smaller, fast model. Escalate only on uncertainty or specific intents.
- Pre-compute intent with a tiny classifier, not another large model call.
- Lock configs per tenant. No global flip that accidentally moves everyone to the expensive tier.
Retrieval architecture that scales
- Shard by tenant and hot content. Keep hot shards in memory, cold shards on cheaper storage.
- Tune replication factor per tier. Public multi-tenant needs more replicas, private single-tenant can use fewer.
- Batch embeddings and rate limit backfills. Keep a separate pool for offline jobs.
- Use lightweight re-rankers. Heavy cross-encoders on every request are a tax.
Get real about batching and GPUs
- Use inference servers that support continuous batching and paged attention. vLLM, TensorRT-LLM, or vendor equivalents.
- Quantize where acceptable. INT4 or FP8 can double throughput if quality holds for your tasks.
- Bin-pack models carefully. Mixed workloads with wildly different max tokens destroy packing. Split them.
- Autoscale on tokens per second and queue depth, not just CPU or RPS. Warm pools sized to peak concurrency, not average.
Tighten SLOs without lighting money on fire
- Measure cost vs p95 curve. Many teams pay 30% more for a p95 reduction users cannot perceive.
- Offer two latency tiers. Interactive gets tight SLOs, background gets cheaper lanes.
- Add server-side timeouts and circuit breakers. Idempotency keys to kill duplicate work.
Caching that actually works
- Cache at the right layer. Retrieval results and prompt expansions are often more cacheable than final generations.
- Use semantic cache only where hit rates justify it. It is not free and can go stale.
- For agents and tool responses, cache tool outputs with content addressing, not LLM outputs.
Observability and budgets
- Instrument cost per feature flag and per prompt template. Break down tokens by path.
- Add per-request budget tags. Fail fast when a path drifts.
- Track E, your effective batching efficiency. If it drops, find the cause before scaling out.
On a recent engagement, these changes in combination cut token usage by 22% and improved packing efficiency by 35%. Net COGS dropped 38% at the same p95.
Business impact you can model
- Every extra 100ms shaved from p95 can cost 10 to 20% more in GPU spend if batching is already thin. Validate with user metrics before tightening.
- Fan out of 1.3 to 1.6 model calls per request often hides in orchestration. Expect 30 to 60% higher COGS vs a single-call estimate.
- Larger context windows raise memory use and reduce batch size, which can double instance count at peak even if tokens stay constant.
- Retrieval infra grows with tenants and compliance. Replication and egress show up as step functions in cost, not a neat line.
If your margin model assumes linear token cost, you will miss these step changes and tail penalties. That is how unit economics quietly drift.
Key takeaways
- Nonlinearity comes from fan out, tail latency headroom, batching collapse, and retrieval replication.
- Put hard budgets on calls, context, and retries. Most waste is controllable.
- Route to smaller models by default. Escalate on uncertainty, not by habit.
- Scale on tokens per second and queue depth. Warm pools and bin packing matter more than you think.
- Cache upstream of the model when possible. Cache tool outputs and retrieval, not only final text.
- Measure effective batching efficiency and cost per path. Optimize the system, not just the model.
If this sounds familiar
If you are seeing cost jump faster than traffic, or your p95 got better while COGS spiked, you are not alone. This is the kind of problem I fix for teams when prototypes turn into real workloads. Reach out if you want a hard look at your architecture, cost curves, and the few changes that will bend that line back down.

