The real cost breakdown of running LLM apps on AWS

The part of your LLM bill you do not see in the demo

The first time most teams see their real LLM bill is not a happy day. The token spend looks fine on paper, then production traffic hits, latency SLOs force you to turn off batching, CloudWatch logs balloon, and someone remembers the NAT gateway. The bill doubles without any feature work.

I have walked into multiple accounts where the model cost was only half the total. The rest was plumbing, idling GPUs, and cross-AZ transfer. If you plan the economics up front, you can avoid most of it.

Where the cost surprises show up and why

Low but spiky traffic
Managed API seems expensive per token, so a team self-hosts a big model. The GPU idles 90 percent of the time between spikes. Cost per request climbs above the API price.
Long prompts and generous defaults
8k context looks safe, someone bumps to 32k, KV cache memory explodes, instances scale out, latency worsens so you turn off batching, cost per token rises.
RAG pipelines with chatty networks
ECS tasks call Bedrock over NAT, embed to a managed vector DB in another AZ, stream tokens back to a client through API Gateway. You pay multiple times for data transfer and NAT.
Observability and compliance
Token-level traces and payload logging are great until CloudWatch ingestion is your third largest line item. KMS on every write makes it worse.
Throttling and retries
Bedrock or your inference fleet throttles at peak. Retries stack up, doubling tokens and compute with no added value.

What most teams misunderstand: the cheapest-looking component is not the cheapest system. Throughput, batching, and network locality dominate at scale. For small and bursty workloads, the opposite is true: idle capacity is the tax.

Deep dive: the actual cost surface of LLM on AWS

Think in systems, not services. You are paying for:

Inference path
— Managed model APIs: Bedrock Anthropic/Meta/Amazon, pay per token in/out. No GPU ops burden. Hard quotas and potential throttling. Provisioned throughput is an option for consistent RPS but you are now renting capacity by the hour, not token.
— Self-hosted: SageMaker or EKS/ECS with vLLM/TGI/TensorRT-LLM on g5/p4d/p5. You pay for instances, EBS, AMIs, ECR storage and traffic, and the team to keep it up. Your effective price per 1k tokens is instance-hour divided by tokens served per hour, plus the fines from latency SLOs when batching is lost.
Retrieval and context wrangling
— Embeddings: Bedrock Titan/Cohere/others pay per vector. Or run your own embedding model on CPU/GPU. Dimensionality matters for both cost and latency.
— Vector store: OpenSearch Serverless, Aurora pgvector, DynamoDB + FAISS, or self-managed. You pay for storage, read/write capacity, and often cross-AZ transfer.
— Pre/post-processing: Lambda/ECS/EKS CPU, serialization, compression, streaming. These add up under load.
Network and security
— NAT gateways for outbound internet or public endpoints. PrivateLink VPC endpoints for Bedrock, S3, OpenSearch, STS remove NAT charges and reduce egress.
— Cross-AZ data transfer between your ALB, ECS, vector DB, and inference. Putting your fleet and store in different AZs is a silent tax.
— Client egress if you stream to the public internet.
Observability and control plane
— CloudWatch logs and metrics, X-Ray traces, Prometheus/AMP. Token-level logs and prompt payloads are big.
— KMS per-request charges if you encrypt everything, which many regulated teams do.
Failure modes that multiply cost
— Long context + small GPUs: OOM, restarts, warmup time. You lose batchability and throughput.
— Spot interruptions during load: recomputation and retries. Without durable KV cache you pay twice for the same tokens.
— Overzealous retries on 429/5xx: doubled token spend. Add backoff and budgets.

The math you should actually run

Run a simple worksheet before you pick a path:
Inputs per request: in_tokens, out_tokens
Traffic: avg_RPS, p95_RPS
Latency target: p95_latency_ms
Batching factor you can realistically hold at p95: batch_size
Model throughput on chosen hardware: tokens_per_second_per_instance at your target quality settings (precision, kv quantization, attention kernel)

Then:

Tokens per request = in_tokens + out_tokens
Tokens per second at p95 = p95_RPS * Tokens per request
Required instances (self-hosted) ≈ (tokens per second at p95) / (tokens_per_second_per_instance * batch_size_efficiency)
Effective price per 1k tokens (self-hosted) ≈ (instance_hour_cost * instances) / (tokens_per_second_total * 3600 / 1000)

Compare that to the managed per-token price plus network and plumbing.

A note from practice: teams overestimate batch_size_efficiency. If your p95 is tight and prompts are variable length, expect effective batch size of 2 to 4 in production, not 8 to 16.

Architecture choices, trade-offs, and failure modes

Managed API only
Pros: fastest to ship, zero GPU ops, elastic, strong model options. Good for low to medium traffic with spikes.
Cons: per-token price floor, upstream quotas, hard to guarantee latency at peak without provisioned throughput. Vendor updates can change behavior.

Self-hosted on SageMaker
Pros: mature managed ML infra, autoscaling endpoints, model registry, VPC-native. Good if you live in SageMaker already.
Cons: cost floor per endpoint, slow cold starts for large models, limited inference server customization without extra work.

Self-hosted on EKS with vLLM/TensorRT-LLM
Pros: highest control, best batching, custom kernels, mixed precision, KV cache tricks, multi-model hosting. Great at steady high throughput.
Cons: you own everything. Node autoscale, pre-warming, GPU bin-packing, health probes that do not kill batching.

Hybrid router
Route easy queries to a small model or cached answer. Send hard queries to a large model or external API.
Real failure: routers that cost more than they save because they run big embeddings or a mid-size model on every request.

Common production footguns

KV cache eviction that drops throughput at unpredictable times.
Context bloat from RAG that adds 2k tokens of low-signal stuff to every request.
Cross-AZ calls between your inference nodes and vector DB.
Bedrock throttling at peak with naive client retries.
CloudWatch log payloads of entire prompts. Nice for debugging, expensive at scale.

Practical ways to cut cost without wrecking latency

Capacity and placement

PrivateLink everywhere you can: Bedrock, S3, OpenSearch. Remove NAT from the hot path.
Keep inference and vector DB in the same AZ set. Do not scatter by accident with different subnets.
For idle or bursty traffic, lean on managed APIs or Bedrock provisioned throughput only during business hours. Turn it off when not needed.

Model-level levers

Right-size context. Every extra 1k prompt tokens hurts twice: more model compute and higher KV memory, which kills batchability.
Use output caps aggressively. max_new_tokens is a budget, not a suggestion.
Quantize where it does not change quality for your task. INT8 KV cache and FP8 weights are sensible defaults for many Llama-class models.
Turn on speculative decoding or MTP where supported. You trade a little extra compute for lower latency and higher effective throughput.

RAG and data

Shrink embedding dimensions if you can. 384 often works fine for business Q&A. It reduces storage, network, and latency.
Deduplicate and summarize snippets before stuffing the context. Compact citations beat raw paragraphs.
Cache by instruction + canonicalized context. For support and doc Q&A you will see heavy repetition.

Middleware and guardrails

Enforce token budgets per endpoint and per tenant. Reject or compress before inference.
Retry with budgets. If a retry would exceed the original token budget, drop or degrade response.
Sample logs. Keep redacted exemplars and traces, not every byte of every prompt.

Autoscaling and batching

Pre-warm a small pool sized for p95. Scale out slowly to protect batching. Scale in even slower to avoid thrash.
Prefer bigger GPUs at moderate scale if it lets you keep batching consistent. Smaller nodes that lose batching are often more expensive per token.

Quick scenario math to anchor decisions

Below are illustrative only. Use your own prices and benchmarks.

Scenario A: 3 RPS average, spikes to 60 RPS, 1.5k in, 400 out

– Managed API: You pay per token during spikes but nothing while idle. No GPU idling. Likely cheapest and simplest.
– Self-host on g5: GPUs idle 95 percent of the day. You either accept high cost per request or get clever with scheduled scale-to-zero and slow cold starts. Usually not worth it.

Scenario B: 40 steady RPS, 2k in, 500 out, strict 1s p95

– Self-host with vLLM on p4d or p5 if you can hold batch size of 4 to 8. Effective price per 1k tokens beats managed APIs, and you keep latency by using speculative decoding.
– If your prompts vary wildly and batching collapses to 1 or 2, managed API or provisioned throughput might be cheaper and easier.

Scenario C: Internal RAG with 80 percent cache hits

– Add a response cache with strict invalidation, keep the model small for misses. Your model spend drops by half or more. Watch out for the cache stampede at deploy and pre-warm hot keys.

Business impact you can actually forecast

Cost
- The biggest levers: batching consistency, context length, and removing NAT/cross-AZ traffic. These regularly swing total cost by 30 to 60 percent.
- Self-host only pays off when you have steady load and can hold batch size. Otherwise, the idle GPU tax dominates.
Performance
- Every token of prompt hurts latency. Shrinking prompt size often buys more latency relief than any kernel tweak.
- Streaming improves perceived latency, which gives you more room to batch.
Scaling risk
- Quotas and throttling are real. Have fallbacks and pre-approval for limit increases long before a launch.
- Spot is great for batch jobs, risky for real-time inference unless you overprovision and keep warm spares.

Key takeaways

Price the system, not the model. Network and idle capacity are real money.
If your traffic is bursty, prefer managed APIs or short-lived provisioned capacity.
If your traffic is steady, self-host only if you can keep batch size up and context small.
Put inference and retrieval in the same AZ and use PrivateLink to kill NAT.
Enforce token budgets. Long prompts silently destroy throughput and margins.
Sample logs and encrypt smartly. Observability can be your third largest line item.

If you want help

If you are staring at a growing LLM bill or fighting p95 while trying to keep batching, this is the kind of work I do. I help teams pick the right architecture, tune throughput, and remove the hidden AWS taxes before things break at scale.

Architect's Brief