The quiet failure that buries RAG systems
If your RAG works in staging but falls apart under real traffic, there is a decent chance your vector database is the reason. Not the model, not the prompt. The retrieval tier. I have seen teams swap models three times before admitting the vector store was the bottleneck all along. By then they had rising LLM costs, creeping latency, and a 3 month reindex ahead of them.
This is avoidable. But the happy-path demos hide the hard parts.
Where the problem shows up and why
- p95 is fine, p99 explodes when filters are applied. Users think the app is flaky. It is your ANN index thrashing.
- You re-embed with a new model and discover a multi-week full rebuild with no safe rollback. Traffic has to share nodes with compaction and segment merges.
- ACL or multi-tenant filters crater recall because the DB does ANN first and filters second, or the filter bitmaps are too sparse.
- Memory doubles without warning. HNSW overhead scales non-linearly with parameters and levels. You now need a bigger instance class, plus replicas.
- Write-heavy pipelines block reads. Online HNSW insertions compete with queries and cause tail spikes.
- Metadata joins force you out of the vector DB into your OLTP store. Now you have cross-system consistency and cold cache misses.
What most teams misunderstand: vector search is not just kNN. It is an indexing strategy, a filtering strategy, a memory layout, and an ingestion plan. Pick the wrong one and you inherit a failure mode you cannot patch with prompts.
Technical deep dive: architecture, trade-offs, failure modes
Retrieval tier shape
A production retrieval stack usually looks like this:
- Embedding producer: streaming or batch, often async from the app.
- Vector store: ANN index, metadata, partitions, replicas.
- Reranker: cross-encoder or LLM re-rank to fix approximate errors.
- Policy and ACL: enforced via filters or pre-materialized lists.
- Cache: result cache by query fingerprint and by top centroid.
The cracks form at the boundaries.
Index families and what they really cost
- HNSW in-memory
- Latency: excellent at low p99 if memory-resident.
- Memory: expensive. 768 dims x 4 bytes is ~3 KB per vector just for floats. Add HNSW graph overhead and metadata and you are near 4 to 6 KB per vector in practice. 100 million vectors is several hundred GB of RAM plus replicas. You feel this in your AWS bill.
- Inserts: online, but high write rates degrade p99. Long GC or merges show up as sawtooth latency.
- IVF, IVFPQ, DiskANN style
- Latency: good if well tuned and on fast NVMe. Slightly higher tail than in-memory but often acceptable with a reranker.
- Memory: much lower with PQ. You trade exactness for compression and speed on SSDs.
- Operationally friendlier for large scale and cheaper to replicate.
- Lucene based (OpenSearch, Elasticsearch with kNN)
- Strong metadata filtering and query planning. Good for hybrid BM25 + vector.
- Two-phase exec is common: pre-filter then ANN or vice versa. If the planner guesses wrong under high selectivity, recall tanks or latency spikes.
- Segment merges can bite you during ingestion bursts.
- pgvector and friends
- Great for small to medium datasets, strong transactions, easy joins, simpler ops.
- Falls over under heavy ANN with complex filters. Parallel query helps but you will hit a wall.
Distance metrics and subtle mistakes
- Cosine vs dot vs L2 is not a footnote. If the store expects L2 but your embeddings assume cosine, recall losses of 20 to 30 percent are normal. L2 normalize at write time if you use cosine.
- Some stores fake cosine with dot product by embedding a norm term. Some do not. Check docs, then verify with an offline recall suite.
Filtering and ACLs
Filters are what kill recall in production. A lot of vector DBs do ANN first, then filter. If your filter selectivity is tight, you get empty top-k and a fallback to slow exhaustive search. Multi-tenant data with per-user ACL pushes you into bitmap hell unless the engine is filter-aware at index time or supports pre-filtered postings.
Sharding and replication
- Global HNSW is hard to shard. If your store does per-shard ANN then merges results, make sure the shard fanout does not dominate tail latency.
- Cross-region replication of large ANN indexes is slow and costly. If you need multi-region RTO under an hour, plan a staging index plus WAL shipping or object-store based snapshots.
Ingestion and rebuilds
- Online insert into HNSW or Lucene looks fine until you ramp. Background merges steal IO and CPU from queries and cause tail spikes.
- Model upgrades without dual indexing is the classic trap. New embeddings shift geometry. Old index is now wrong. Full rebuild while serving traffic is where incidents are born.
Practical ways to not shoot yourself in the foot
1. Choose based on the workload, not a benchmark chart
- Small dataset, strong transactional needs, moderate filters: start with Postgres + pgvector. Simpler ops and easy rollback.
- Heavy filters, need BM25 + vector, search-style relevance: Lucene based. Accept segment merge management and plan capacity for it.
- Massive corpus, low-latency, cost sensitive: IVF or DiskANN on NVMe with PQ, plus a reranker. HNSW only if you truly need sub-20 ms p99 at scale and can afford the RAM.
2. Budget memory like an adult
- Estimate per-vector memory: vector_size_bytes + index_overhead + metadata.
- For 768-d float vectors: ~3 KB base. Add 1 to 3 KB index overhead depending on M, efConstruction, and implementation. Round to 5 KB. Multiply by vectors and replicas. Do not forget hot spares.
- If the math gives you 1 TB RAM, you are in compressed or disk ANN territory. Pretending otherwise only delays the pain.
3. Make filters first-class
- If you have ACL or high-cardinality filters, pick a store that can push filters into the ANN stage or maintain filter-aware postings per list.
- Precompute tenant or ACL partitions. Co-locate data by tenant to keep filters selective and shard-local.
- Measure filter selectivity distributions and test under the 5th and 95th percentile. That is where p99 lives.
4. Treat model changes as schema changes
- Version embeddings. Write new vectors to a parallel index or collection.
- Shadow traffic to the new index and compare recall@k and downstream answer quality for a week.
- Cut over behind a flag. Keep the old index for rollback until error budgets stabilize.
5. Isolate ingestion from serving
- Use a write buffer and async indexers. Readers should not share the same process pool as heavy builders.
- Time partition your index so compactions are local. Write to the newest segment, query across N recent segments, and roll older segments to read-mostly hardware.
6. Expect to rerank
- Do not spend 2x on RAM to chase perfect recall when a small cross-encoder or a re-ranking pass fixes top-k noise.
- Design for k=100 to 200, then rerank to top-10. This keeps ANN cheap and quality stable.
7. Normalize and validate, every time
- L2 normalize if you use cosine. Enforce at write time.
- Lock distance metric by collection. Reject writes that do not match expected dim or metric.
8. Observability for retrieval, not just API latency
- Track recall@k on a fixed eval set, empty-hit rate under filters, and tail latency with and without ingestion running.
- Emit filter selectivity and shard fanout as first-class metrics.
- Alert on sudden drops in average inner product or cosine of retrieved items against known gold sets. That catches silent embedding drift.
9. Cache the right thing
- Cache query fingerprints to top-k IDs, not raw vectors. Invalidate on partition roll or ACL change.
- Layer a short TTL cache at the reranker stage. It saves both ANN QPS and LLM tokens on repeated tasks.
Business impact that shows up on your bill and roadmap
- Cost: in-memory HNSW at 100M items with 2x replication often means several high-mem nodes. That is easily tens of thousands per month before you pay for LLM inference. PQ on NVMe can cut memory 5 to 10x at the cost of slightly more CPU and a reranker.
- Performance: filter-aware indexes reduce p99 by avoiding post-filtering misses. That shortens the end-to-end path and saves tokens because fewer retries and fallbacks hit the LLM.
- Scaling risk: a full reindex for a model change can take days. If your DB cannot build offline or you cannot run dual indexes, your roadmap will slip. This is the hidden lock-in that hurts more than API compatibility.
Key takeaways
- Pick the index family for your filter profile and scale, not for a pretty top-1 recall chart.
- Budget RAM with real per-vector math. If it is huge, move to PQ or disk ANN plus reranking.
- Version embeddings and cut over behind flags. Never in-place upgrade vectors.
- Make filters shard-local and push them into ANN where possible.
- Isolate ingestion from serving. Segment merges and online HNSW inserts will wreck your tail.
- Monitor recall, empty hits, and filter selectivity, not just p95 latency.
- Normalize embeddings and match the metric. Cosine vs L2 mistakes are silent killers.
If this sounds familiar
If you are fighting tail latency under filters, planning a re-embed, or staring at an index that will not finish building, I have been there. This is exactly the kind of thing I help teams fix when systems start breaking at scale. Happy to review an architecture or run a focused drill on your retrieval tier.

