{"id":31,"date":"2025-03-22T14:07:53","date_gmt":"2025-03-22T14:07:53","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/03\/22\/vector-db-choice-can-kill-your-system\/"},"modified":"2026-04-10T19:29:33","modified_gmt":"2026-04-10T19:29:33","slug":"vector-db-choice-can-kill-your-system","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/03\/22\/vector-db-choice-can-kill-your-system\/","title":{"rendered":"Why vector DB choice can kill your system"},"content":{"rendered":"<h2>The quiet failure that buries RAG systems<\/h2>\n<p>If your RAG works in staging but falls apart under real traffic, there is a decent chance your vector database is the reason. Not the model, not the prompt. The retrieval tier. I have seen teams swap models three times before admitting the vector store was the bottleneck all along. By then they had rising LLM costs, creeping latency, and a 3 month reindex ahead of them.<\/p>\n<p>This is avoidable. But the happy-path demos hide the hard parts.<\/p>\n<h2>Where the problem shows up and why<\/h2>\n<ul>\n<li>p95 is fine, p99 explodes when filters are applied. Users think the app is flaky. It is your ANN index thrashing.<\/li>\n<li>You re-embed with a new model and discover a multi-week full rebuild with no safe rollback. Traffic has to share nodes with compaction and segment merges.<\/li>\n<li>ACL or multi-tenant filters crater recall because the DB does ANN first and filters second, or the filter bitmaps are too sparse.<\/li>\n<li>Memory doubles without warning. <a href=\"https:\/\/angirash.in\/blog\/2025\/12\/18\/why-most-rag-architectures-break-under-real-user-load\/\">HNSW overhead scales non-linearly with parameters<\/a> and levels. You now need a bigger instance class, plus replicas.<\/li>\n<li>Write-heavy pipelines block reads. Online HNSW insertions compete with queries and cause tail spikes.<\/li>\n<li>Metadata joins force you out of the vector DB into your OLTP store. Now you have cross-system consistency and cold cache misses.<\/li>\n<\/ul>\n<p>What most teams misunderstand: vector search is not just kNN. It is an indexing strategy, a filtering strategy, a memory layout, and an ingestion plan. Pick the wrong one and you inherit a failure mode you cannot patch with prompts.<\/p>\n<h2>Technical deep dive: architecture, trade-offs, failure modes<\/h2>\n<h3>Retrieval tier shape<\/h3>\n<p>A production retrieval stack usually looks like this:<\/p>\n<ul>\n<li>Embedding producer: streaming or batch, often async from the app.<\/li>\n<li>Vector store: ANN index, metadata, partitions, replicas.<\/li>\n<li>Reranker: cross-encoder or LLM re-rank to fix approximate errors.<\/li>\n<li>Policy and ACL: enforced via filters or pre-materialized lists.<\/li>\n<li>Cache: result cache by query fingerprint and by top centroid.<\/li>\n<\/ul>\n<p>The cracks form at the boundaries.<\/p>\n<h3>Index families and what they really cost<\/h3>\n<ul>\n<li>HNSW in-memory\n<ul>\n<li>Latency: excellent at low p99 if memory-resident.<\/li>\n<li>Memory: expensive. 768 dims x 4 bytes is ~3 KB per vector just for floats. Add HNSW graph overhead and metadata and you are near 4 to 6 KB per vector in practice. 100 million vectors is several hundred GB of RAM plus replicas. You feel this in your AWS bill.<\/li>\n<li>Inserts: online, but high write rates degrade p99. Long GC or merges show up as sawtooth latency.<\/li>\n<\/ul>\n<\/li>\n<li>IVF, IVFPQ, DiskANN style\n<ul>\n<li>Latency: good if well tuned and on fast NVMe. Slightly higher tail than in-memory but often acceptable with a reranker.<\/li>\n<li>Memory: much lower with PQ. You trade exactness for compression and speed on SSDs.<\/li>\n<li>Operationally friendlier for large scale and cheaper to replicate.<\/li>\n<\/ul>\n<\/li>\n<li>Lucene based (OpenSearch, Elasticsearch with kNN)\n<ul>\n<li>Strong metadata filtering and query planning. <a href=\"https:\/\/angirash.in\/blog\/2025\/10\/02\/hybrid-search-vs-vector-search-production\/\">Good for hybrid BM25 + vector<\/a>.<\/li>\n<li>Two-phase exec is common: pre-filter then ANN or vice versa. If the planner guesses wrong under high selectivity, recall tanks or latency spikes.<\/li>\n<li>Segment merges can bite you during ingestion bursts.<\/li>\n<\/ul>\n<\/li>\n<li>pgvector and friends\n<ul>\n<li>Great for small to medium datasets, strong transactions, easy joins, simpler ops.<\/li>\n<li>Falls over under heavy ANN with complex filters. Parallel query helps but you will hit a wall.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3><a href=\"https:\/\/angirash.in\/blog\/2025\/05\/14\/embedding-quality-over-model-choice\/\">Distance metrics and subtle mistakes<\/a><\/h3>\n<ul>\n<li>Cosine vs dot vs L2 is not a footnote. If the store expects L2 but your embeddings assume cosine, recall losses of 20 to 30 percent are normal. L2 normalize at write time if you use cosine.<\/li>\n<li>Some stores fake cosine with dot product by embedding a norm term. Some do not. Check docs, then verify with an offline recall suite.<\/li>\n<\/ul>\n<h3>Filtering and ACLs<\/h3>\n<p>Filters are what kill recall in production. A lot of vector DBs do ANN first, then filter. If your filter selectivity is tight, you get empty top-k and a fallback to slow exhaustive search. Multi-tenant data with per-user ACL pushes you into bitmap hell unless the engine is filter-aware at index time or supports pre-filtered postings.<\/p>\n<h3>Sharding and replication<\/h3>\n<ul>\n<li>Global HNSW is hard to shard. If your store does per-shard ANN then merges results, make sure the shard fanout does not dominate tail latency.<\/li>\n<li>Cross-region replication of large ANN indexes is slow and costly. If you need multi-region RTO under an hour, plan a staging index plus WAL shipping or object-store based snapshots.<\/li>\n<\/ul>\n<h3>Ingestion and rebuilds<\/h3>\n<ul>\n<li>Online insert into HNSW or Lucene looks fine until you ramp. Background merges steal IO and CPU from queries and cause tail spikes.<\/li>\n<li>Model upgrades without dual indexing is the classic trap. <a href=\"https:\/\/angirash.in\/blog\/2025\/03\/21\/why-ai-architecture-looks-right-fails-in-production\/\">New embeddings shift geometry. Old index is now wr<\/a>ong. Full rebuild while serving traffic is where incidents are born.<\/li>\n<\/ul>\n<h2>Practical ways to not shoot yourself in the foot<\/h2>\n<h3>1. Choose based on the workload, not a benchmark chart<\/h3>\n<ul>\n<li>Small dataset, strong transactional needs, moderate filters: start with Postgres + pgvector. Simpler ops and easy rollback.<\/li>\n<li>Heavy filters, need BM25 + vector, search-style relevance: Lucene based. Accept segment merge management and plan capacity for it.<\/li>\n<li>Massive corpus, low-latency, cost sensitive: IVF or DiskANN on NVMe with PQ, plus a reranker. HNSW only if you truly need sub-20 ms p99 at scale and can afford the RAM.<\/li>\n<\/ul>\n<h3>2. Budget memory like an adult<\/h3>\n<ul>\n<li>Estimate per-vector memory: vector_size_bytes + index_overhead + metadata.<\/li>\n<li>For 768-d float vectors: ~3 KB base. Add 1 to 3 KB index overhead depending on M, efConstruction, and implementation. Round to 5 KB. Multiply by vectors and replicas. Do not forget hot spares.<\/li>\n<li>If the math gives you 1 TB RAM, you are in compressed or disk ANN territory. Pretending otherwise only delays the pain.<\/li>\n<\/ul>\n<h3>3. Make filters first-class<\/h3>\n<ul>\n<li>If you have ACL or high-cardinality filters, pick a store that can push filters into the ANN stage or maintain filter-aware postings per list.<\/li>\n<li>Precompute tenant or ACL partitions. Co-locate data by tenant to keep filters selective and shard-local.<\/li>\n<li>Measure filter selectivity distributions and test under the 5th and 95th percentile. That is where p99 lives.<\/li>\n<\/ul>\n<h3>4. Treat model changes as schema changes<\/h3>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/22\/mlops-for-llms-what-actually-matters\/\">Version embeddings<\/a>. Write new vectors to a parallel index or collection.<\/li>\n<li>Shadow traffic to the new index and compare recall@k and downstream answer quality for a week.<\/li>\n<li>Cut over behind a flag. Keep the old index for rollback until error budgets stabilize.<\/li>\n<\/ul>\n<h3>5. Isolate ingestion from serving<\/h3>\n<ul>\n<li>Use a write buffer and async indexers. Readers should not share the same process pool as heavy builders.<\/li>\n<li>Time partition your index so compactions are local. Write to the newest segment, query across N recent segments, and roll older segments to read-mostly hardware.<\/li>\n<\/ul>\n<h3>6. Expect to rerank<\/h3>\n<ul>\n<li>Do not spend 2x on RAM to chase perfect recall when a small cross-encoder or a re-ranking pass fixes top-k noise.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/09\/07\/accuracy-latency-tradeoff-ai-systems\/\">Design for k=100 to 200, then rerank to top-10<\/a>. This keeps ANN cheap and quality stable.<\/li>\n<\/ul>\n<h3>7. Normalize and validate, every time<\/h3>\n<ul>\n<li>L2 normalize if you use cosine. Enforce at write time.<\/li>\n<li>Lock distance metric by collection. Reject writes that do not match expected dim or metric.<\/li>\n<\/ul>\n<h3>8. Observability for retrieval, not just API latency<\/h3>\n<ul>\n<li>Track recall@k on a fixed eval set, empty-hit rate under filters, and tail latency with and without ingestion running.<\/li>\n<li>Emit filter selectivity and shard fanout as first-class metrics.<\/li>\n<li>Alert on sudden drops in average inner product or cosine of retrieved items against known gold sets. That catches silent embedding drift.<\/li>\n<\/ul>\n<h3>9. Cache the right thing<\/h3>\n<ul>\n<li>Cache query fingerprints to top-k IDs, not raw vectors. Invalidate on partition roll or ACL change.<\/li>\n<li>Layer a short TTL cache at the reranker stage. It saves both ANN QPS and LLM tokens on repeated tasks.<\/li>\n<\/ul>\n<h2>Business impact that shows up on your bill and roadmap<\/h2>\n<ul>\n<li>Cost: in-memory HNSW at 100M items with 2x replication often means <a href=\"https:\/\/angirash.in\/blog\/2025\/11\/21\/real-cost-breakdown-llm-apps-on-aws\/\">several high-mem nodes. That is easily tens of thousands per month<\/a> before you pay for LLM inference. PQ on NVMe can cut memory 5 to 10x at the cost of slightly more CPU and a reranker.<\/li>\n<li>Performance: filter-aware indexes reduce p99 by avoiding post-filtering misses. That shortens the end-to-end path and saves tokens because fewer retries and fallbacks hit the LLM.<\/li>\n<li>Scaling risk: a full reindex for a model change can take days. If your DB cannot build offline or you cannot run dual indexes, your roadmap will slip. This is the hidden lock-in that hurts more than API compatibility.<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Pick the index family for your filter profile and scale, not for a pretty top-1 recall chart.<\/li>\n<li>Budget RAM with real per-vector math. If it is huge, move to PQ or disk ANN plus reranking.<\/li>\n<li>Version embeddings and cut over behind flags. Never in-place upgrade vectors.<\/li>\n<li>Make filters shard-local and push them into ANN where possible.<\/li>\n<li>Isolate ingestion from serving. Segment merges and online HNSW inserts will wreck your tail.<\/li>\n<li>Monitor recall, empty hits, and filter selectivity, not just p95 latency.<\/li>\n<li>Normalize embeddings and match the metric. Cosine vs L2 mistakes are silent killers.<\/li>\n<\/ul>\n<h2>If this sounds familiar<\/h2>\n<p>If you are fighting tail latency under filters, planning a re-embed, or staring at an index that will not finish building, I have been there. This is exactly the kind of thing I help teams fix when systems start breaking at scale. Happy to review an architecture or run a focused drill on your retrieval tier.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The quiet failure that buries RAG systems If your RAG works in staging but falls apart under real traffic, there is a decent chance your vector database is the reason&#8230;. <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[12],"tags":[17,16,18],"class_list":["post-31","post","type-post","status-publish","format-standard","hentry","category-ai-failures","tag-ai-cost","tag-ai-scalability","tag-vector-db"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/31","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=31"}],"version-history":[{"count":3,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/31\/revisions"}],"predecessor-version":[{"id":201,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/31\/revisions\/201"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=31"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=31"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=31"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}