The painful question I get every quarter
We are spending a fortune on GPUs. Can we move inference to CPUs and cut cost without blowing up latency?
I have walked into too many orgs where GPUs sit at 25 to 40 percent utilization while product teams argue about p95. The bill is ugly and nobody is sure what knob to turn first. Some teams swing to CPU autoscaling to save money, only to learn they traded $ for tail latency and noisy autoscaler behavior.
This post is the reality check I wish more teams had before choosing hardware. Not the marketing slide. The operator’s view.
Where this goes wrong
You see the problem in three places:
- LLM inference with uneven traffic. Utilization never stabilizes, batching is timid, and tokenization eats CPU.
- Embeddings and rerankers mixed with web traffic. Burstiness turns GPU wins into GPU waste.
- Mid-size fine-tunes or adapters. The team tries to squeeze training on general-purpose CPUs and gets a week-long science project.
Why it happens:
- Hourly price gets compared, not tokens-per-dollar. FLOPs on paper get confused with end-to-end throughput once you add tokenization, KV cache, and IO.
- Poor batching and small sequence lengths neutralize GPU parallelism. That 80 GB card is loafing.
- Data movement kills you. PCIe, NUMA, tiny kernels, and cache misses quietly eat latency.
What most teams misunderstand:
- The GPU tax is not a tax if you actually fill the card. It is often the cheapest path per token once you cross a very modest TPS.
- CPUs do not scale linearly for transformer inference under strict latency SLOs. You hit vector width limits, memory bandwidth, and kernel launch overhead.
- Embeddings are not always a slam dunk for GPUs. It depends on batch size, sequence length, and SLO.
Technical deep dive
Think in tokens-per-dollar, not instance-per-hour
Pick a target SLO and compute:
cost_per_1M_tokens = hourly_cost / (tokens_per_sec * 3600). Use generate tokens for LLMs, not prompt tokens.- Include model load warmup and average idle time. Your real denominator is sustained tokens per second, not a microbenchmark peak.
If your GPU is idle 50 percent of the time, your cost per token roughly doubles. That is the lever that matters.
Architecture pressure points
- Batching and sequence length. GPUs win big when batches are >1 and sequence lengths are non-trivial. CPUs fall apart once you need high concurrency with strict p95.
- KV cache and attention kernels. Modern libs like vLLM and TensorRT-LLM fuse kernels and use paged attention. CPUs rarely match this for 7B+ models.
- Tokenization and preprocessing. At low TPS, CPU tokenization dominates. At high TPS, put tokenization near the model host, pin memory, and stop copying payloads through extra services.
- Memory bandwidth, not just FLOPs. LLM decode is often memory bound. HBM on a GPU dwarfs CPU DRAM bandwidth. This is why 7B+ decode tends to favor GPUs.
- Data movement tax. PCIe hops, cross-socket NUMA, and chatty microservices quietly add 10 to 30 ms. Simple, co-located model servers beat cute service graphs.
Failure modes I see too often
- Under-batching to protect latency. You end up paying GPU rates for CPU-like throughput.
- Single giant GPU for mixed workloads. Head-of-line blocking punishes small requests. Use MIG or split pools by profile.
- Over-quantizing without accuracy gates. INT4 looks cheap until your product metrics drop.
- Wrong CPU family. Running transformer inference on instances without AVX512 or AMX is basically setting money on fire.
- PCIe saturation with cheap NICs. You save on instance type then lose to network jitter and tail latency.
When GPUs win, when CPUs win
Here is the practical view I give teams. Your numbers will vary, but the patterns hold.
GPUs generally win for
- LLM inference at 7B and up, sustained TPS above ~1 per replica, context length above ~1k tokens.
- Any workload that benefits from batching: batch size >1 most of the time.
- Embeddings at scale if you can batch or you have p95 under 100 ms with variable payload sizes.
- Training or fine-tuning anything non-trivial. You will hit a wall on CPUs.
CPUs can win for
- Bursty, low-QPS endpoints with strict cold start budgets where you cannot keep a GPU hot. Think cron-style, sporadic scoring.
- Tiny or distilled models under ~1B params, short sequences, batch size 1, moderate SLOs. Good candidates: rule-heavy classifiers, light rerankers.
- Constrained environments. Edge inference where a GPU is not feasible.
- Vector search. Often CPU-first unless you need massive throughput on high-dim dense vectors, then GPU-accelerated search helps.
My rule of thumb: I rarely recommend CPUs for 7B+ LLM inference beyond toy traffic unless you are truly under 0.5 TPS per model replica and cannot batch.
Practical solutions and designs
1) Start with a cost curve, not a hunch
- Run a 2 to 4 hour soak test at realistic traffic. Measure sustained tokens per second and p95 end-to-end, not just model server latency.
- Profile with and without dynamic batching. Sweep batch sizes and max wait times. Latency budgets are usually looser than engineers think.
2) Pick the right silicon for the job
- LLM decode heavy: H100 or A100 if you need serious throughput or long context. L4 or A10 for cost-sensitive 7B inference. T4 is dated unless your SLOs are kind.
- Embeddings: L4 often gives the best tokens-per-dollar unless you are extremely spiky, then CPUs might win. Try AMX-enabled Intel or latest EPYC with BF16 for small batches.
- CPUs: Choose AVX512 or AMX capable nodes for transformers. Otherwise, do not bother.
3) Partition and schedule correctly
- Separate pools: interactive LLMs, batch embeddings, and sporadic jobs should not share cards. Use MIG on A100/H100 or smaller GPUs for isolation.
- Batch aggressively on non-interactive work. Push batch sizes high with offline queues and target throughput per dollar, not p95.
4) Cut the data movement tax
- Co-locate tokenization and model runtime. Pin memory. Avoid extra network hops.
- Use paged KV cache and larger prefill batches. Most of your speedups come from better memory behavior, not new hardware.
5) Quantize with guardrails
- Start with FP8 or INT8 using proven kernels. Gate INT4 behind quality checks that matter for your product, not just perplexity.
- Keep a shadow canary on higher precision to catch regressions.
6) Buy the utilization, not the GPU
- If you cannot keep a GPU hot, use smaller GPUs or serverless GPU inference for spiky endpoints.
- Consider spot capacity for batch workloads with checkpointing. Not for interactive LLMs unless you have seamless failover.
Cost and scaling impact you can actually plan for
- Utilization is the multiplier. 30 percent idle time increases your real cost per token by 43 percent. If you are idling at 60 percent, you are paying 2.5x.
- Batching moves cost curves the most. Going from batch 1 to batch 8 can yield 2 to 5x throughput improvements on the same card with a small p95 trade.
- Wrong instance family is a hidden 1.3 to 2x tax. AMX or AVX512 for CPU inference, HBM-rich GPUs for long-context decode.
- Tail latency blows up concurrency planning. A busy GPU with poorly tuned batching will produce timeouts that look like capacity issues. Fix batching before buying more silicon.
Quick decision cheatsheet
- You have an LLM >7B with p95 under 200 ms and TPS above 1 per replica: use GPUs with dynamic batching and vLLM or TensorRT-LLM.
- You have embeddings with traffic spikes but low average QPS: test CPUs with AMX and a small GPU pool for hot hours. Autoscale the GPU pool, not the CPUs.
- You have cron-style backfills: spot GPUs or CPUs depending on batchability. If you can batch to 64+, GPUs usually win.
- You have a strict budget and a quality floor: start with INT8. Do not jump to INT4 without a product metric gate.
Key takeaways
- Measure cost per 1M generate tokens at sustained load, not instance price per hour.
- GPUs are cheaper than they look once you batch and keep them hot.
- CPUs are a niche win for tiny models, batch 1, or bursty low QPS where keeping a GPU hot is unrealistic.
- Most savings come from better scheduling, batching, and data movement, not swapping hardware.
- Separate pools by workload profile. Isolation beats hero nodes.
If you need a hand
If you are staring at a GPU bill with 30 percent utilization and rising p95, you are not alone. I help teams map workloads to the right silicon, tune batching and memory, and cut cost per token without gutting product metrics. If you are running into similar issues, this is exactly the kind of thing I help teams fix when systems start breaking at scale.

