The true cost of self‑hosting LLMs vs using APIs

The real bill usually arrives at p95

I keep seeing the same pattern: a team proves out a feature on an API, gets a scary bill, then someone says “we can run a 7B model for pennies on our own.” Two sprints later, p99 latency is ugly, admins are babysitting GPUs at night, and the infra line item did not go down. The gap between spreadsheet math and production reality is where most self‑hosted LLM efforts get stuck.

This post is the cost breakdown I use with CTOs before they buy eight H100s or lock into an API contract they will regret.

Where this breaks and why

Where it shows up
- Internal copilots and search that cross 50–200M tokens/day
- Customer‑facing chat where p95 matters and traffic is spiky
- Regulated environments where data residency is non‑negotiable
Why it happens in real systems
- Utilization is the whole game. GPUs are cheap per token only when busy with large, continuous batches. Most apps don’t have that traffic shape.
- Context length quietly dominates memory and cost. KV cache grows with sequence length and active sessions, not just model size.
- Model churn and safety layers add overhead you didn’t budget: evals, red‑teaming, jailbreak filters, prompt rewriting, routing.
What teams misunderstand
- Single‑stream tokens/sec is not throughput. Continuous batching determines your real cost per 1M tokens.
- “A100 on sale” does not mean “A100 economics.” NVLink, locality, spot eviction, and orchestration risk matter.
- The price you compare to is not the API sticker. It is your blended rate after retries, guardrails, eval, logging, and latency SLOs.

Technical deep dive: how the numbers actually move

The architecture differences that move cost

APIs
- You pay per token. Hidden costs are retries, latency headroom, logging/eval pipelines, network egress. Upgrades and model swaps are free.
Self‑host
- You pay for provisioned capacity. The stack most teams end up with: gateway + tokenizer + vLLM/TGI + sharding/tensor parallel + autoscaler + metrics + safety/rules + retries + storage for traces. Cost rises with context length, concurrency, and uptime targets.

Throughput math that decides your fate

A simple but reliable calculator:
– Cost per 1M tokens = GPU_hourly_cost / (effective_tokens_per_second × 3600) × 1,000,000
– Effective TPS is after batching, safety passes, and retries.

Some ballpark ranges I’ve actually seen in production with vLLM and sane quantization:
– 7–8B class on L40S 48 GB at $1.2–$1.8/hour
– Effective 1.2k–2.0k tok/s when traffic allows continuous batching
– Cost per 1M tokens: ~$0.20–$0.50 before overhead
– 7–8B class on A100 80 GB at $3–$4/hour
– Effective 1.8k–3.0k tok/s
– Cost per 1M tokens: ~$0.25–$0.75 before overhead
– 70B class on 8×H100 80 GB with NVLink at $80–$120/hour for the node
– Effective 0.9k–1.6k tok/s depending on context and batching
– Cost per 1M tokens: ~$50–$130 before overhead

Notes
– If your traffic is spiky and single‑stream dominated, cut those TPS numbers in half (or worse). Cost per 1M will double accordingly.
– If you can’t pack batches continuously, you won’t see the “cheap” numbers on Twitter screenshots.

Context length is a silent budget killer

KV cache size grows with layers, heads, and your active sequence length. A back‑of‑envelope estimate per token:
– KV bytes per token ≈ 2 × num_layers × num_heads × head_dim × bytes_per_element
– Example for a Llama‑class 7B (32 layers, 32 heads, head_dim 128, fp16 2 bytes):
– 2 × 32 × 32 × 128 × 2 ≈ 524,288 bytes per token ≈ 0.5 MB/token
– 8k tokens of active context per session ≈ ~4 GB of KV cache per stream

Things to internalize
– Long contexts or many simultaneous sessions will eat an 80 GB card quickly even for small models.
– Quantization helps weights, not KV as much. FP8/FP16 KV is still big.

Batching vs latency

Continuous batching is mandatory for good economics. vLLM’s scheduler is excellent, but you pay with added queuing.
Rule of thumb I’ve used: maximize batch until p95 latency hits your SLO, then stop. Cost per 1M tokens will usually be convex in that region.

Failure modes that create surprise bills

Spot preemption and resharding flapping NVLink peers
OOMs from KV growth after feature flags turn on “helpful” context expansion
Retry storms on gateway timeouts that double tokens with nothing to show
Model upgrade day: brief quality win, week‑long prompt retuning and regressions
Underutilized night hours because nobody wired burst traffic to fill batches

Practical ways to make the right call

When self‑hosting makes sense

You can keep GPUs at 60–80 percent utilization with continuous batching
Your use case tolerates slightly higher p95 to gain lower unit cost
You can win with a smaller model fine‑tuned on your domain
Data residency or vendor risk forces you off multi‑tenant APIs

If that’s you, aim for this stack:
– Serving: vLLM with AWQ or FP8 where quality holds, pinned to known‑good builds
– Hardware: L40S 48 GB for 7–13B; A100 80 GB if you need more headroom; only use H100 clusters when 70B quality is non‑negotiable and volume is high
– Topology: prefer single‑node for 7–13B to avoid tensor parallel overhead; NVLink for anything 70B+
– Autoscaling: tokens‑in‑flight as the primary signal, not QPS
– Observability: per‑prompt cost, TPS, batch depth, KV memory, and p95 by route

When APIs are the sane choice

Spiky or low volume traffic where you can’t keep GPUs busy
You need frontier quality right now, and model refresh cadence matters
Multi‑tenant safety and eval pipelines you don’t want to run yourself
You have hard p95 constraints and global routing

Design it like you own it anyway
– Put a proxy in front with request dedupe, budget caps, and per‑feature cost attribution
– Cache system prompts and static preambles
– Use a simple router: small model for easy cases, escalate to frontier only on uncertainty or eval score

Hybrid that actually works

Default to a local 8–13B fine‑tuned model for 60–80 percent of traffic
Escalate to an API model on uncertainty or when context exceeds local capacity
Log escalations, re‑label, and periodically fine‑tune to reduce the escalation rate
This pattern often cuts blended costs 30–60 percent without sacrificing quality

What this means in dollars

Here are realistic scenarios I walk through with teams. All are compute only, then I add 20–40 percent for software, storage, ops, and safety.

8B on L40S, steady internal copilot, 1.5k tok/s effective, $1.5/hour
- Cost per 1M tokens ≈ $1.5 / (1500×3600/1e6) ≈ $0.28
- Add 30 percent overhead → ~$0.36 per 1M
- If a comparable API is $0.5–$2 per 1M, self‑host wins as long as you maintain utilization
8B on A100 80 GB, spiky chat, 700 tok/s effective, $3.5/hour
- Cost per 1M ≈ $3.5 / (700×3600/1e6) ≈ $1.39
- Add 30 percent → ~$1.80 per 1M
- You likely lose to an API unless you batch aggressively or cache a lot
70B on 8×H100, moderate volume, 1.2k tok/s effective, $96/hour
- Cost per 1M ≈ $96 / (1200×3600/1e6) ≈ $22
- Add 30 percent → ~$29 per 1M
- Most API pricing for frontier‑class models will be hard to beat here unless your volume is huge and steady or you can negotiate cheaper GPUs

Hidden costs most spreadsheets miss
– Engineering time: 1–2 FTE to keep a self‑hosted stack healthy at scale
– Eval runs: tokens you burn to keep quality stable after model updates
– Safety: classification passes add 5–20 percent token overhead depending on flow
– Capacity insurance: idle headroom to protect p95

Trade‑offs you actually feel

Quality velocity: API vendors ship stronger models faster than you can safely re‑platform
Control: self‑hosting gives you knobs for determinism, routing, and data control
Latency: local small models can beat APIs on tail latency, but only if you shape traffic and avoid tensor parallel for everything
Risk: APIs centralize model failures; self‑hosting creates more ways to shoot yourself in the foot

Key takeaways

Your true unit cost is utilization × batching × context management. Miss any one and the math breaks.
7–13B models can be cheaper than APIs if you keep GPUs hot and accept some p95 trade‑off.
70B+ models are rarely cheaper to self‑host unless you have sustained volume and discounted H100s with NVLink.
Add 20–40 percent on top of raw compute for ops, safety, eval, and storage. That is real money.
A hybrid router that escalates tough cases to APIs is often the best first step.

If you need a sanity check

If you want a second set of eyes on a cost model, a router design, or whether your throughput targets are even reachable on your hardware, I’m happy to look. This is exactly the kind of thing I help teams fix when systems start breaking at scale.

Architect's Brief

The true cost of self‑hosting LLMs vs using APIs

The real bill usually arrives at p95

Where this breaks and why

Technical deep dive: how the numbers actually move

The architecture differences that move cost

Throughput math that decides your fate

Context length is a silent budget killer

Batching vs latency

Failure modes that create surprise bills

Practical ways to make the right call

When self‑hosting makes sense

When APIs are the sane choice

Hybrid that actually works

What this means in dollars

Trade‑offs you actually feel

Key takeaways

If you need a sanity check

Category Name

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Recent Posts

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

AI Observability: Stop Guessing, Start Instrumenting

Categories

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS