{"id":83,"date":"2025-08-14T10:22:57","date_gmt":"2025-08-14T10:22:57","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/08\/14\/true-cost-self-hosting-llms-vs-apis\/"},"modified":"2026-04-10T19:29:21","modified_gmt":"2026-04-10T19:29:21","slug":"true-cost-self-hosting-llms-vs-apis","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/08\/14\/true-cost-self-hosting-llms-vs-apis\/","title":{"rendered":"The true cost of self\u2011hosting LLMs vs using APIs"},"content":{"rendered":"<h2>The real bill usually arrives at p95<\/h2>\n<p>I keep seeing the same pattern: a team proves out a feature on an API, gets a scary bill, then someone says \u201cwe can run a 7B model for pennies on our own.\u201d Two sprints later, p99 latency is ugly, admins are babysitting GPUs at night, and the infra line item did not go down. The gap between spreadsheet math and production reality is where most self\u2011hosted LLM efforts get stuck.<\/p>\n<p>This post is the cost breakdown I use with CTOs before they buy eight H100s or lock into an API contract they will regret.<\/p>\n<h2>Where this breaks and why<\/h2>\n<ul>\n<li>Where it shows up\n<ul>\n<li>Internal copilots and search that cross 50\u2013200M tokens\/day<\/li>\n<li>Customer\u2011facing chat where p95 matters and traffic is spiky<\/li>\n<li>Regulated environments where data residency is non\u2011negotiable<\/li>\n<\/ul>\n<\/li>\n<li>Why it happens in real systems\n<ul>\n<li>Utilization is the whole game. GPUs are cheap per token only when busy with large, continuous batches. Most apps don\u2019t have that traffic shape.<\/li>\n<li>Context length quietly dominates memory and cost. <a href=\"https:\/\/angirash.in\/blog\/2025\/02\/18\/why-scaling-ai-systems-increases-cost-nonlinearly\/\">KV cache grows with sequence length and active sessions<\/a>, not just model size.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/03\/where-your-ai-budget-quietly-leaks\/\">Model churn and safety layers add overhead<\/a> you didn\u2019t budget: evals, red\u2011teaming, jailbreak filters, prompt rewriting, routing.<\/li>\n<\/ul>\n<\/li>\n<li>What teams misunderstand\n<ul>\n<li>Single\u2011stream tokens\/sec is not throughput. Continuous batching determines your real cost per 1M tokens.<\/li>\n<li>\u201cA100 on sale\u201d does not mean \u201cA100 economics.\u201d NVLink, locality, spot eviction, and orchestration risk matter.<\/li>\n<li>The price you compare to is not the API sticker. It is your blended rate after retries, guardrails, eval, logging, and latency SLOs.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Technical deep dive: how the numbers actually move<\/h2>\n<h3>The architecture differences that move cost<\/h3>\n<ul>\n<li>APIs\n<ul>\n<li>You pay per token. Hidden costs are retries, latency headroom, logging\/eval pipelines, network egress. Upgrades and model swaps are free.<\/li>\n<\/ul>\n<\/li>\n<li>Self\u2011host\n<ul>\n<li>You pay for provisioned capacity. The stack most teams end up with: gateway + tokenizer + vLLM\/TGI + sharding\/tensor parallel + autoscaler + metrics + safety\/rules + retries + storage for traces. Cost rises with context length, concurrency, and uptime targets.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Throughput math that decides your fate<\/h3>\n<p>A simple but reliable calculator:<br \/>\n&#8211; Cost per 1M tokens = GPU_hourly_cost \/ (effective_tokens_per_second \u00d7 3600) \u00d7 1,000,000<br \/>\n&#8211; Effective TPS is after batching, safety passes, and retries.<\/p>\n<p>Some ballpark ranges I\u2019ve actually seen in production with vLLM and sane quantization:<br \/>\n&#8211; 7\u20138B class on L40S 48 GB at $1.2\u2013$1.8\/hour<br \/>\n  &#8211; Effective 1.2k\u20132.0k tok\/s when traffic allows continuous batching<br \/>\n  &#8211; Cost per 1M tokens: ~$0.20\u2013$0.50 before overhead<br \/>\n&#8211; 7\u20138B class on A100 80 GB at $3\u2013$4\/hour<br \/>\n  &#8211; Effective 1.8k\u20133.0k tok\/s<br \/>\n  &#8211; Cost per 1M tokens: ~$0.25\u2013$0.75 before overhead<br \/>\n&#8211; 70B class on 8\u00d7H100 80 GB with NVLink at $80\u2013$120\/hour for the node<br \/>\n  &#8211; Effective 0.9k\u20131.6k tok\/s depending on context and batching<br \/>\n  &#8211; Cost per 1M tokens: ~$50\u2013$130 before overhead<\/p>\n<p>Notes<br \/>\n&#8211; If your traffic is spiky and single\u2011stream dominated, cut those TPS numbers in half (or worse). Cost per 1M will double accordingly.<br \/>\n&#8211; If you can\u2019t pack batches continuously, you won\u2019t see the \u201ccheap\u201d numbers on Twitter screenshots.<\/p>\n<h3>Context length is a silent budget killer<\/h3>\n<p>KV cache size grows with layers, heads, and your active sequence length. A back\u2011of\u2011envelope estimate per token:<br \/>\n&#8211; KV bytes per token \u2248 2 \u00d7 num_layers \u00d7 num_heads \u00d7 head_dim \u00d7 bytes_per_element<br \/>\n&#8211; Example for a Llama\u2011class 7B (32 layers, 32 heads, head_dim 128, fp16 2 bytes):<br \/>\n  &#8211; 2 \u00d7 32 \u00d7 32 \u00d7 128 \u00d7 2 \u2248 524,288 bytes per token \u2248 0.5 MB\/token<br \/>\n  &#8211; 8k tokens of active context per session \u2248 ~4 GB of KV cache per stream<\/p>\n<p>Things to internalize<br \/>\n&#8211; <a href=\"https:\/\/angirash.in\/blog\/2025\/03\/19\/token-costs-what-actually-moves-the-needle-in-production\/\">Long contexts or many simultaneous sessions<\/a> will eat an 80 GB card quickly even for small models.<br \/>\n&#8211; Quantization helps weights, not KV as much. FP8\/FP16 KV is still big.<\/p>\n<h3>Batching vs latency<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/07\/18\/streaming-vs-batching-llm-systems\/\">Continuous batching is mandatory<\/a> for good economics. vLLM\u2019s scheduler is excellent, but <a href=\"https:\/\/angirash.in\/blog\/2025\/08\/12\/why-your-llm-response-time-is-inconsistent\/\">you pay with added queuing<\/a>.<\/li>\n<li>Rule of thumb I\u2019ve used: <a href=\"https:\/\/angirash.in\/blog\/2025\/04\/21\/llm-latency-in-production-what-actually-works\/\">maximize batch until p95 latency hits your SLO<\/a>, then stop. Cost per 1M tokens will usually be convex in that region.<\/li>\n<\/ul>\n<h3>Failure modes that create surprise bills<\/h3>\n<ul>\n<li>Spot preemption and resharding flapping NVLink peers<\/li>\n<li>OOMs from KV growth after feature flags turn on \u201chelpful\u201d context expansion<\/li>\n<li>Retry storms on gateway timeouts that double tokens with nothing to show<\/li>\n<li>Model upgrade day: brief quality win, week\u2011long prompt retuning and regressions<\/li>\n<li>Underutilized night hours because nobody wired burst traffic to fill batches<\/li>\n<\/ul>\n<h2>Practical ways to make the right call<\/h2>\n<h3>When self\u2011hosting makes sense<\/h3>\n<ul>\n<li>You can keep GPUs at 60\u201380 percent utilization with continuous batching<\/li>\n<li>Your use case tolerates slightly higher p95 to gain lower unit cost<\/li>\n<li>You can win with a smaller model fine\u2011tuned on your domain<\/li>\n<li>Data residency or vendor risk forces you off multi\u2011tenant APIs<\/li>\n<\/ul>\n<p>If that\u2019s you, aim for this stack:<br \/>\n&#8211; Serving: vLLM with AWQ or FP8 where quality holds, pinned to known\u2011good builds<br \/>\n&#8211; Hardware: <a href=\"https:\/\/angirash.in\/blog\/2025\/07\/18\/gpu-vs-cpu-ai-cost-performance-tradeoffs\/\">L40S 48 GB for 7\u201313B; A100 80 GB<\/a> if you need more headroom; only use H100 clusters when 70B quality is non\u2011negotiable and volume is high<br \/>\n&#8211; Topology: prefer single\u2011node for 7\u201313B to avoid tensor parallel overhead; NVLink for anything 70B+<br \/>\n&#8211; Autoscaling: tokens\u2011in\u2011flight as the primary signal, not QPS<br \/>\n&#8211; Observability: per\u2011prompt cost, TPS, batch depth, KV memory, and p95 by route<\/p>\n<h3>When APIs are the sane choice<\/h3>\n<ul>\n<li>Spiky or low volume traffic where you can\u2019t keep GPUs busy<\/li>\n<li>You need frontier quality right now, and model refresh cadence matters<\/li>\n<li>Multi\u2011tenant safety and eval pipelines you don\u2019t want to run yourself<\/li>\n<li>You have hard p95 constraints and global routing<\/li>\n<\/ul>\n<p>Design it like you own it anyway<br \/>\n&#8211; Put a proxy in front with request dedupe, budget caps, and per\u2011feature cost attribution<br \/>\n&#8211; Cache system prompts and static preambles<br \/>\n&#8211; Use a simple router: small model for easy cases, escalate to frontier only on uncertainty or eval score<\/p>\n<h3>Hybrid that actually works<\/h3>\n<ul>\n<li>Default to a local 8\u201313B fine\u2011tuned model for 60\u201380 percent of traffic<\/li>\n<li>Escalate to an API model on uncertainty or when context exceeds local capacity<\/li>\n<li>Log escalations, re\u2011label, and periodically fine\u2011tune to reduce the escalation rate<\/li>\n<li>This pattern often cuts blended costs 30\u201360 percent without sacrificing quality<\/li>\n<\/ul>\n<h2>What this means in dollars<\/h2>\n<p>Here are realistic scenarios I walk through with teams. All are compute only, then I add 20\u201340 percent for software, storage, ops, and safety.<\/p>\n<ul>\n<li>8B on L40S, steady internal copilot, 1.5k tok\/s effective, $1.5\/hour\n<ul>\n<li>Cost per 1M tokens \u2248 $1.5 \/ (1500\u00d73600\/1e6) \u2248 $0.28<\/li>\n<li>Add 30 percent overhead \u2192 ~$0.36 per 1M<\/li>\n<li>If a comparable API is $0.5\u2013$2 per 1M, self\u2011host wins as long as you maintain utilization<\/li>\n<\/ul>\n<\/li>\n<li>8B on A100 80 GB, spiky chat, 700 tok\/s effective, $3.5\/hour\n<ul>\n<li>Cost per 1M \u2248 $3.5 \/ (700\u00d73600\/1e6) \u2248 $1.39<\/li>\n<li>Add 30 percent \u2192 ~$1.80 per 1M<\/li>\n<li>You likely lose to an API unless you batch aggressively or cache a lot<\/li>\n<\/ul>\n<\/li>\n<li>70B on 8\u00d7H100, moderate volume, 1.2k tok\/s effective, $96\/hour\n<ul>\n<li>Cost per 1M \u2248 $96 \/ (1200\u00d73600\/1e6) \u2248 $22<\/li>\n<li>Add 30 percent \u2192 ~$29 per 1M<\/li>\n<li>Most API pricing for frontier\u2011class models will be hard to beat here unless your volume is huge and steady or you can negotiate cheaper GPUs<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Hidden costs most spreadsheets miss<br \/>\n&#8211; Engineering time: 1\u20132 FTE to keep a self\u2011hosted stack healthy at scale<br \/>\n&#8211; Eval runs: tokens you burn to keep quality stable after model updates<br \/>\n&#8211; Safety: classification passes add 5\u201320 percent token overhead depending on flow<br \/>\n&#8211; Capacity insurance: idle headroom to protect p95<\/p>\n<h2>Trade\u2011offs you actually feel<\/h2>\n<ul>\n<li>Quality velocity: API vendors ship stronger models faster than you can safely re\u2011platform<\/li>\n<li>Control: self\u2011hosting gives you knobs for determinism, routing, and data control<\/li>\n<li>Latency: local small models can beat APIs on tail latency, but only if you shape traffic and avoid tensor parallel for everything<\/li>\n<li>Risk: APIs centralize model failures; self\u2011hosting creates more ways to shoot yourself in the foot<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Your true unit cost is utilization \u00d7 batching \u00d7 context management. Miss any one and the math breaks.<\/li>\n<li>7\u201313B models can be cheaper than APIs if you keep GPUs hot and accept some p95 trade\u2011off.<\/li>\n<li>70B+ models are rarely cheaper to self\u2011host unless you have sustained volume and discounted H100s with NVLink.<\/li>\n<li>Add 20\u201340 percent on top of raw compute for ops, safety, eval, and storage. That is real money.<\/li>\n<li>A hybrid router that escalates tough cases to APIs is often the best first step.<\/li>\n<\/ul>\n<h2>If you need a sanity check<\/h2>\n<p>If you want a second set of eyes on a cost model, a router design, or whether your throughput targets are even reachable on your hardware, I\u2019m happy to look. This is exactly the kind of thing I help teams fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The real bill usually arrives at p95 I keep seeing the same pattern: a team proves out a feature on an API, gets a scary bill, then someone says \u201cwe&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7],"tags":[17,20,21],"class_list":["post-83","post","type-post","status-publish","format-standard","hentry","category-ai-cost-optimization","tag-ai-cost","tag-ai-infra","tag-llmops"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/83","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=83"}],"version-history":[{"count":3,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/83\/revisions"}],"predecessor-version":[{"id":196,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/83\/revisions\/196"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=83"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=83"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=83"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}