The real bill usually arrives at p95
I keep seeing the same pattern: a team proves out a feature on an API, gets a scary bill, then someone says “we can run a 7B model for pennies on our own.” Two sprints later, p99 latency is ugly, admins are babysitting GPUs at night, and the infra line item did not go down. The gap between spreadsheet math and production reality is where most self‑hosted LLM efforts get stuck.
This post is the cost breakdown I use with CTOs before they buy eight H100s or lock into an API contract they will regret.
Where this breaks and why
- Where it shows up
- Internal copilots and search that cross 50–200M tokens/day
- Customer‑facing chat where p95 matters and traffic is spiky
- Regulated environments where data residency is non‑negotiable
- Why it happens in real systems
- Utilization is the whole game. GPUs are cheap per token only when busy with large, continuous batches. Most apps don’t have that traffic shape.
- Context length quietly dominates memory and cost. KV cache grows with sequence length and active sessions, not just model size.
- Model churn and safety layers add overhead you didn’t budget: evals, red‑teaming, jailbreak filters, prompt rewriting, routing.
- What teams misunderstand
- Single‑stream tokens/sec is not throughput. Continuous batching determines your real cost per 1M tokens.
- “A100 on sale” does not mean “A100 economics.” NVLink, locality, spot eviction, and orchestration risk matter.
- The price you compare to is not the API sticker. It is your blended rate after retries, guardrails, eval, logging, and latency SLOs.
Technical deep dive: how the numbers actually move
The architecture differences that move cost
- APIs
- You pay per token. Hidden costs are retries, latency headroom, logging/eval pipelines, network egress. Upgrades and model swaps are free.
- Self‑host
- You pay for provisioned capacity. The stack most teams end up with: gateway + tokenizer + vLLM/TGI + sharding/tensor parallel + autoscaler + metrics + safety/rules + retries + storage for traces. Cost rises with context length, concurrency, and uptime targets.
Throughput math that decides your fate
A simple but reliable calculator:
– Cost per 1M tokens = GPU_hourly_cost / (effective_tokens_per_second × 3600) × 1,000,000
– Effective TPS is after batching, safety passes, and retries.
Some ballpark ranges I’ve actually seen in production with vLLM and sane quantization:
– 7–8B class on L40S 48 GB at $1.2–$1.8/hour
– Effective 1.2k–2.0k tok/s when traffic allows continuous batching
– Cost per 1M tokens: ~$0.20–$0.50 before overhead
– 7–8B class on A100 80 GB at $3–$4/hour
– Effective 1.8k–3.0k tok/s
– Cost per 1M tokens: ~$0.25–$0.75 before overhead
– 70B class on 8×H100 80 GB with NVLink at $80–$120/hour for the node
– Effective 0.9k–1.6k tok/s depending on context and batching
– Cost per 1M tokens: ~$50–$130 before overhead
Notes
– If your traffic is spiky and single‑stream dominated, cut those TPS numbers in half (or worse). Cost per 1M will double accordingly.
– If you can’t pack batches continuously, you won’t see the “cheap” numbers on Twitter screenshots.
Context length is a silent budget killer
KV cache size grows with layers, heads, and your active sequence length. A back‑of‑envelope estimate per token:
– KV bytes per token ≈ 2 × num_layers × num_heads × head_dim × bytes_per_element
– Example for a Llama‑class 7B (32 layers, 32 heads, head_dim 128, fp16 2 bytes):
– 2 × 32 × 32 × 128 × 2 ≈ 524,288 bytes per token ≈ 0.5 MB/token
– 8k tokens of active context per session ≈ ~4 GB of KV cache per stream
Things to internalize
– Long contexts or many simultaneous sessions will eat an 80 GB card quickly even for small models.
– Quantization helps weights, not KV as much. FP8/FP16 KV is still big.
Batching vs latency
- Continuous batching is mandatory for good economics. vLLM’s scheduler is excellent, but you pay with added queuing.
- Rule of thumb I’ve used: maximize batch until p95 latency hits your SLO, then stop. Cost per 1M tokens will usually be convex in that region.
Failure modes that create surprise bills
- Spot preemption and resharding flapping NVLink peers
- OOMs from KV growth after feature flags turn on “helpful” context expansion
- Retry storms on gateway timeouts that double tokens with nothing to show
- Model upgrade day: brief quality win, week‑long prompt retuning and regressions
- Underutilized night hours because nobody wired burst traffic to fill batches
Practical ways to make the right call
When self‑hosting makes sense
- You can keep GPUs at 60–80 percent utilization with continuous batching
- Your use case tolerates slightly higher p95 to gain lower unit cost
- You can win with a smaller model fine‑tuned on your domain
- Data residency or vendor risk forces you off multi‑tenant APIs
If that’s you, aim for this stack:
– Serving: vLLM with AWQ or FP8 where quality holds, pinned to known‑good builds
– Hardware: L40S 48 GB for 7–13B; A100 80 GB if you need more headroom; only use H100 clusters when 70B quality is non‑negotiable and volume is high
– Topology: prefer single‑node for 7–13B to avoid tensor parallel overhead; NVLink for anything 70B+
– Autoscaling: tokens‑in‑flight as the primary signal, not QPS
– Observability: per‑prompt cost, TPS, batch depth, KV memory, and p95 by route
When APIs are the sane choice
- Spiky or low volume traffic where you can’t keep GPUs busy
- You need frontier quality right now, and model refresh cadence matters
- Multi‑tenant safety and eval pipelines you don’t want to run yourself
- You have hard p95 constraints and global routing
Design it like you own it anyway
– Put a proxy in front with request dedupe, budget caps, and per‑feature cost attribution
– Cache system prompts and static preambles
– Use a simple router: small model for easy cases, escalate to frontier only on uncertainty or eval score
Hybrid that actually works
- Default to a local 8–13B fine‑tuned model for 60–80 percent of traffic
- Escalate to an API model on uncertainty or when context exceeds local capacity
- Log escalations, re‑label, and periodically fine‑tune to reduce the escalation rate
- This pattern often cuts blended costs 30–60 percent without sacrificing quality
What this means in dollars
Here are realistic scenarios I walk through with teams. All are compute only, then I add 20–40 percent for software, storage, ops, and safety.
- 8B on L40S, steady internal copilot, 1.5k tok/s effective, $1.5/hour
- Cost per 1M tokens ≈ $1.5 / (1500×3600/1e6) ≈ $0.28
- Add 30 percent overhead → ~$0.36 per 1M
- If a comparable API is $0.5–$2 per 1M, self‑host wins as long as you maintain utilization
- 8B on A100 80 GB, spiky chat, 700 tok/s effective, $3.5/hour
- Cost per 1M ≈ $3.5 / (700×3600/1e6) ≈ $1.39
- Add 30 percent → ~$1.80 per 1M
- You likely lose to an API unless you batch aggressively or cache a lot
- 70B on 8×H100, moderate volume, 1.2k tok/s effective, $96/hour
- Cost per 1M ≈ $96 / (1200×3600/1e6) ≈ $22
- Add 30 percent → ~$29 per 1M
- Most API pricing for frontier‑class models will be hard to beat here unless your volume is huge and steady or you can negotiate cheaper GPUs
Hidden costs most spreadsheets miss
– Engineering time: 1–2 FTE to keep a self‑hosted stack healthy at scale
– Eval runs: tokens you burn to keep quality stable after model updates
– Safety: classification passes add 5–20 percent token overhead depending on flow
– Capacity insurance: idle headroom to protect p95
Trade‑offs you actually feel
- Quality velocity: API vendors ship stronger models faster than you can safely re‑platform
- Control: self‑hosting gives you knobs for determinism, routing, and data control
- Latency: local small models can beat APIs on tail latency, but only if you shape traffic and avoid tensor parallel for everything
- Risk: APIs centralize model failures; self‑hosting creates more ways to shoot yourself in the foot
Key takeaways
- Your true unit cost is utilization × batching × context management. Miss any one and the math breaks.
- 7–13B models can be cheaper than APIs if you keep GPUs hot and accept some p95 trade‑off.
- 70B+ models are rarely cheaper to self‑host unless you have sustained volume and discounted H100s with NVLink.
- Add 20–40 percent on top of raw compute for ops, safety, eval, and storage. That is real money.
- A hybrid router that escalates tough cases to APIs is often the best first step.
If you need a sanity check
If you want a second set of eyes on a cost model, a router design, or whether your throughput targets are even reachable on your hardware, I’m happy to look. This is exactly the kind of thing I help teams fix when systems start breaking at scale.

