Your RAG is slow because it moves too much data, hops across too many services, and pays LLMs to read junk. It is expensive for the same reasons.
I see the same pattern on audits. p95 over 4 seconds at moderate load, context windows stuffed with 10 to 30 unrelated chunks, and a reranker running on everything because someone set top_k to 50 and called it a day. The sad part is most of this is fixable without changing vendors.
Where this shows up and why it happens
- User-visible lag: search is instant, your “AI” takes 3 to 7 seconds. People bounce.
- Cloud bill creep: tokens in dwarf tokens out. Retrieval and rerankers add steady overhead.
- Incident noise: timeouts under concurrency, cold starts, and vector DB P95 spikes.
Why it happens in real systems:
- Over-fetching: top_k set high to mask poor recall. You push 10k to 40k tokens per request into the LLM.
- Misconfigured vector search: ef_search cranked up, IVF params at defaults, no prefiltering. Latency and RAM spike.
- Reranking in the hot path for every query. A cross-encoder on k=50 is 50 forward passes. On CPU that is rough.
- Cross-region calls: app in one region, vector DB in another, model in a third. 70 to 150 ms per hop adds up.
- Chunking by character length with no structure. You split mid-sentence, lose semantics, then compensate by widening k.
- No budgets or gates. Every request takes the most expensive path because there is no cheap path.
What teams misunderstand:
- Recall is not a function of k alone. Garbage chunking and weak filters cannot be fixed by top_k.
- Input tokens are the cost center. Rerankers and vector search are noise compared to the LLM reading 20k tokens.
- Vector DBs are not magic. Index type, filters, and shard layout decide p95, not the brand name.
- Multi-tenant everything looks clean until one noisy tenant blows cache locality and index search falls off a cliff.
Technical deep dive
Request path anatomy
A typical RAG request quietly does 7 to 12 things:
- Normalize and embed the query
- Vector search with metadata filters
- Optional keyword or BM25 search and merge
- Rerank candidates with a cross encoder
- Pack and de-duplicate context
- Call the LLM
- Optional tool calls or follow-up retrieval
- Postprocess and stream
Every hop is a network call or model call. Serial dependencies punish latency. If any step is cross-region or cold, p95 explodes.
Vector search trade-offs that actually matter
- HNSW parameters: higher ef_search gives better recall but non-linear latency growth and memory pressure. ef_construction affects index build time and quality. Most teams leave both at defaults.
- IVF or PQ: IVF lowers search time by probing lists. If nprobe is too low, recall dies. If too high, you are doing a near full scan. PQ cuts memory but hurts recall unless you re-rank with exact distances on a small candidate set.
- Dimensionality: 1536-d embeddings are not free. Memory, bandwidth, and index size scale with dims. A good 384-d model is often enough for enterprise content.
- Filters before ANN: real acceleration comes from narrowing the candidate set by tenant, doc type, language, or section. If you filter after vector search, you are wasting cycles on the wrong neighborhood.
Rerankers
- Cross-encoders fix a lot, but they are not free. On CPU, 50 pairs can add 150 to 400 ms. On GPU you can get it down to tens of ms, but now you are capacity planning GPUs for a reranker.
- Lossy top_k combined with a reranker is fine. Over-broad top_k with a reranker is waste. If your reranker runs on 50 because retrieval returns junk, fix retrieval.
Context packing is the silent killer
- Stuffing 20 chunks of 500 tokens each is 10k tokens per request. If your input price is the dominant cost, you just lit your budget on fire.
- Duplicates and near-duplicates sneak in when you split by length without structure or parent-child links.
- Irrelevant boilerplate like headers, nav, and legal disclaimers get embedded and retrieved. The LLM reads them all.
Network and placement
- Cross-region trips are 70 to 150 ms each in the best case. Vector DB in region A, app in region B, model in region C means you are sunk at p95.
- Serverless cold starts for embedding or reranking add 100 to 800 ms. Under bursty traffic, you will see sawtooth p95.
Failure modes I keep seeing
- Full scans hiding as vector search because of a missing index on the filter column.
- Multi-tenant index without sharding. One tenant with 40 million vectors makes everyone slow.
- Async ingestion with no backpressure. Index quality flaps under load and recall degrades exactly when you need it.
- Prompt templates that drag entire section titles and citations for each chunk, inflating tokens by 30 to 60 percent.
Practical fixes that move the needle
Put hard budgets in the plan
- Retrieval + rerank budget: 250 to 400 ms p95
- LLM input tokens: 2k to 4k cap per request except on escalated paths
- End to end: 1.5 to 3.0 s p95 for interactive UX
Enforce with gates. If the plan would exceed the budget, take the cheaper route: reduce k, skip rerank, or ask a follow-up clarification.
Make retrieval precise before making it wide
- Chunk with structure. Sentence windowing with overlap, store parent-child links. Attach titles, headings, and IDs as metadata for filtering and for display, not inside the text.
- Filter first. Tenant, product, language, and section filters should narrow the ANN set. Verify with query plans.
- Hybrid retrieval. Sparse + dense reduces the need for massive k. Use sparse to catch exact terms, dense for semantics, then union and dedupe.
- Dynamic k. Start at k=8 to 12, bump only if confidence or coverage is low. Confidence proxy can be the average similarity or dispersion across hits.
Rerank surgically
- Gate the reranker on entropy. If top candidates are tight by similarity, skip rerank. If spread is flat, rerank k=20 at most.
- Run rerankers on GPU only if you have volume. Otherwise, pick a lighter cross-encoder and run on CPU with batching.
Crush token waste
- Summarize chunks offline. Store 150 to 250 token abstracts with citations. Retrieve abstracts by default, escalate to full text only for tough questions.
- Pack intelligently. Deduplicate by doc ID, merge adjacent chunks, strip boilerplate. Enforce a strict input token ceiling.
- Smaller embeddings. Move from 768 to 384 dims if quality allows. Your index shrinks, caches are more effective, and ANN gets faster.
Collapse network hops
- Co-locate app, vector DB, and LLM endpoint in the same region and preferably the same AZ. This is not optional for p95.
- Keep warm. Long-lived connections for vector DB, connection pools for LLM, no per-request auth token fetches.
Cache like you mean it
- L0: prompt+context response cache with TTL and semantic dedupe of queries.
- L1: retrieval cache on normalized queries to candidate IDs. Invalidate on doc updates by doc ID.
- L2: embedding cache for repeated sentences and titles. It is surprising how often they repeat.
Choose sane index settings
- HNSW: start with ef_search around 64 to 128 for online, tune based on recall tests. Keep ef_construction high at build time if you can afford it.
- IVF: choose nlist so that each list has 1k to 5k vectors for your scale. Set nprobe to hit 1 to 5 percent of lists. Validate with offline recall.
- Shard per tenant or per major corpus. You want hot tenants to hit hot shards.
Add a cheap path and an expensive path
- Cheap path: k=8, no rerank, abstracts only, small model. Target sub-second.
- Expensive path: k=20, rerank, partial full text, larger model. Only if cheap path confidence is low or user insists.
Observability that actually helps
Track per stage with tracing:
- vecdb_ms, rerank_ms, pack_ms, llm_queue_ms, llm_compute_ms
- k, tokens_in, tokens_out, distinct_docs, duplicates_removed
- Cache hit rates at L0, L1, L2
- Cost per request estimate at logging time
And offline eval sets with coverage and factuality scores so you tune without flying blind.
Business impact with rough math
Let us compare 100k requests per day.
Naive path:
- top_k=40, average 400 tokens per chunk, 16k input tokens
- cross-encoder rerank on 40 pairs
- p95 4.5 to 7.0 s under load
- Input tokens dominate 85 to 95 percent of cost
Optimized path:
- dynamic k=8 to 16, abstracts at 200 tokens, 1.6k to 3.2k input tokens
- rerank only when similarity spread is flat, at k=16 to 20
- p95 1.8 to 2.7 s
- 60 to 80 percent reduction in input tokens, similar accuracy after tuning
If your token cost is the majority line item, cutting input by 4x usually cuts total cost by 3x to 3.5x. Reranker gating trims a few more percent and stabilizes p95. Co-location and caching remove the silly 200 to 400 ms network tax per request.
Scaling risk avoided:
- Without sharding and filters, every new tenant is a latency multiplier.
- Without budgets, product teams will keep adding extra steps until p95 fails SLOs.
- Without offline evals, teams widen k to hide recall problems and quietly triple costs.
Key takeaways
- Stop paying LLMs to read junk. Cap tokens in, summarize offline, and pack cleanly.
- Fix retrieval precision before widening top_k. Filters and chunking beat brute force.
- Rerank only when the candidate set is ambiguous. Do not default to expensive.
- Keep everything in one region and keep connections warm. Network tax is real.
- Add cheap and expensive paths with explicit budgets. Make the system say no.
- Measure at each stage. You cannot tune what you do not see.
If this resonates
If your RAG stack is slow or the unit economics look shaky, this is exactly the kind of work I do. I help teams set budgets, fix retrieval, choose sane index settings, and cut token waste without giving up quality. If you want a short audit that ends with a concrete action plan, reach out.

