Token costs: what actually moves the needle in production

The real problem

If your LLM bill surprised you last month, it probably was not the flashy features. It was the quiet stuff you never show the user: bloated system prompts, oversized retrieval chunks, tool outputs pasted back into the model, and history that keeps getting re-injected on every turn.

On one client system, 72 percent of tokens per request were invisible to the user. They were paying for embeddings to over-retrieve, a fat system message, 10 pages of doc snippets, previous 8 turns, and a verbose JSON tool response the model never needed. Everyone was trying to shorten the final answer. Wrong target.

Where token waste shows up and why

Chat assistants with memory: history grows linearly with each turn if you do not summarize or anchor facts
RAG: embeddings over-retrieve, chunking is off, no re-rankers, and you paste full pages
Tools: you call a data API and dump 5k tokens of JSON back into the LLM
Multi-agent and evaluators: extra passes without a token budget or exit criteria
Safety and formatting: legal disclaimers and markdown templates added to every turn

What most teams misunderstand

Output is not the main driver. Input often dominates. Output tokens are pricey, but your system prompt plus context often beats the answer length by 2 to 5x
Long context models are not free. The convenience tax is real. 200k context can silently triple cost even if you only use 10k
Chunk size and overlap are not cosmetic. They decide if you pay for 2 pages or 20
Caching saves real money only if you keep prompts stable. Random IDs and timestamps kill cache hits

Technical deep dive: what actually drives cost

Think in layers. A single call usually looks like this:

1) Static system content
– Policy text, tone, tooling instructions
– Typical waste: 800 to 3k tokens every turn because no one separates static and dynamic parts or caches them

2) Conversation state
– Full transcript vs compact state
– Typical waste: linear growth per turn when you reattach every message

3) Retrieval context
– Top K chunks, often from page-level retrieval with large overlap
– Typical waste: 3k to 10k tokens because of aggressive K, no re-rank, and pasting full documents

4) Tool results
– JSON blobs, CSV dumps, screenshots turned into captions
– Typical waste: 1k to 6k tokens because you returned full rows and all fields

5) Model output
– Usually shorter than everything above, yet gets all the blame

Trade-offs you should acknowledge

Smaller model plus a planner pass can beat a single big model with a huge context, but only if you cap retries and control what goes to each pass
Aggressive summarization reduces spend but risks drift. Extractive compression and quoting spans is safer than abstractive paraphrasing for compliance
Long-context models simplify engineering but hide retrieval mistakes and cost more per token. They also invite lazy prompting

Failure modes I keep seeing

Summaries drift facts after 4 or 5 turns, then your agent confidently acts on fiction
Caching configured but invalidated by non-deterministic whitespace, timestamps, or shuffled tool lists
Function calling loops where the model keeps asking the same tool with slight parameter variations
Over-truncation that cuts citations, triggering hallucinations and higher retries later

Practical fixes that move the bill

1) Put a hard token budget per request
– Example: 2.5k input, 400 output. Split input as 600 system, 600 history summary, 1.0k retrieval, 300 tool results
– If any layer breaches its budget, trim that layer first instead of truncating blindly at the end

2) Stop paying for fat system prompts every turn
– Split system content into static core and per-request deltas
– Use provider prompt caching where available and keep the static segment byte-identical
– Remove policies that can live in code. Keep only what the model must read

3) Fix retrieval before you tune prompts
– Use small chunks with low overlap. 600 to 1,200 tokens per chunk is a good starting band
– Add a cross-encoder re-ranker to cut K down to 3 to 5 chunks, not 10 to 20
– Extractive compression: include only quoted spans + a short header per chunk
– Prefer section-level retrieval over page-level. Pages inflate tokens and dilute relevance
– If you need summaries, compress with a cheaper model and keep links back to original spans

4) Stop dumping tool results into the LLM
– Ask tools to return only the fields you need
– Cap rows. If you need to reason about aggregates, compute them in code
– For structured tasks, use function calling to request precise fields, then render user-facing text on the server

5) Conversation state that does not grow unbounded
– Maintain a rolling summary of facts + decisions. Keep references to full messages off-model
– Store machine-readable state separately. Only pass diffs or keys, never the entire object each turn
– TTL older turns aggressively unless the user explicitly asks to revisit

6) Output control that the model actually follows
– Prefer JSON schema or function calling with enums and short field names over free text
– Set stop sequences to cut standard boilerplate. For example, stop at \n\n if you only want a single paragraph
– Max tokens should be set. Do not let the model free run because you are scared of truncation. If the task needs more, plan multiple short calls

7) Router patterns that earn their keep
– Use a small model for classification and task routing
– Only escalate to a large model when uncertainty or task type requires it
– Make the router cheap and deterministic. Thresholds should be tuned offline with evals, not in prod by vibes

8) Make caching real, not theoretical
– Keep cacheable segments stable: same tool order, no timestamps, no random IDs
– Providers now support partial message caching. Segment prompts so that static chunks are cacheable
– Log cache hit rates by path. If you are not seeing 30 to 60 percent hits on static-heavy prompts, you are probably invalidating accidentally

9) Tokenization discipline
– Measure real token counts per layer with the exact tokenizer used by the provider
– Strip boilerplate. Collapse whitespace and markdown noise
– Replace verbose labels with short keys. Replace repeated legal text with a single reference line if policy allows
– Avoid base64 or large inline binary-in-text. Pass references and fetch out of band

10) Multimodal is a silent tax
– Images often explode into long captions. Downsample, crop to regions, or OCR to text if you only need fields
– Do not send 10 screenshots when 2 will do. One client cut 70 percent of vision tokens by cropping to the table region they actually needed

11) Retry strategy with budgets
– Cap total tokens per user action across retries
– Retry with a smaller context, not the same bloated one
– Add loop guards for function calling. Detect repetitive tool requests and stop

A quick architecture sketch that works

Pre-step: Router on small model determines task type
Retrieval: BM25 + embeddings recall to 50, cross-encoder re-rank to top 5, extractive compress to 900 tokens total
Prompt: Static system core cached, compact state, minimal tools, compressed context
Call 1: Big model for reasoning, function call for any API I/O
Call 2: Optional verifier on small model to check constraints. If fail, fix with delta-only context
Output: Structured JSON to server, server renders user text

This pattern consistently drops input tokens by 40 to 70 percent without quality loss. In some cases, quality improves because the model is not drowning in irrelevant context.

What this looks like in numbers

A real case from Q4:

Before: 6.8k input, 900 output per turn, $0.010 per 1k input, $0.030 per 1k output. Cost per turn about $0.12
After: 2.1k input, 450 output per turn. Cost per turn about $0.043
Daily 80k turns saved roughly $6,100 per day and improved P95 latency by 28 percent since streaming fewer tokens is faster

Your numbers will differ by provider, but the pattern holds. The big win is almost always input reduction, not squeezing the final answer.

Business impact and scaling risks

Predictability: Without per-layer budgets, costs scale superlinearly with usage spikes. FinOps will hunt you
Latency: Every 1k fewer tokens shaves real time. That compounds across multi-pass pipelines
Quality: Long context is not free quality. It raises cost and hides retrieval bugs. A tight context with good re-ranking usually beats a massive paste
Risk: Summarization drift can contaminate decisions. If you must summarize, keep references and use extractive compression for critical data

Key takeaways

Set a token budget per request and enforce it per layer
Cache static system content and keep it byte-identical
Fix retrieval first. Re-rank and compress instead of pasting pages
Keep tool outputs tiny. Compute in code, not in the LLM
Maintain compact conversation state with TTL and diffs
Use structured outputs, stop sequences, and strict max_tokens
Route to small models by default and escalate only when needed
Track cache hit rate, token mix by layer, and retries per request

If this resonates

If your LLM bill is driven by invisible tokens and the fixes above feel doable but not trivial in your stack, that is normal. I help teams design token-aware architectures, set budgets that hold in production, and bring costs down without gutting quality. If you are running into similar issues, this is exactly the kind of thing I help teams fix when systems start breaking at scale.

Architect's Brief

Token costs: what actually moves the needle in production

The real problem

Where token waste shows up and why

Technical deep dive: what actually drives cost

Practical fixes that move the bill

A quick architecture sketch that works

What this looks like in numbers

Business impact and scaling risks

Key takeaways

If this resonates

Category Name

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Recent Posts

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

AI Observability: Stop Guessing, Start Instrumenting

Categories

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS