{"id":68,"date":"2025-10-02T10:23:58","date_gmt":"2025-10-02T10:23:58","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/hybrid-search-vs-vector-search-production\/"},"modified":"2026-04-10T19:05:33","modified_gmt":"2026-04-10T19:05:33","slug":"hybrid-search-vs-vector-search-production","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/10\/02\/hybrid-search-vs-vector-search-production\/","title":{"rendered":"Hybrid search vs vector search: what actually works in production"},"content":{"rendered":"<h2>The painful pattern<\/h2>\n<p>The vector-only demo looks great in a sandbox. Then you ship and support tickets pile up. Acronyms don\u2019t resolve, filters don\u2019t filter, legal asks for deterministic behavior, and your latency SLO gets wrecked by a reranker bolted on at the last minute. I\u2019ve seen teams rip out a fresh vector stack and crawl back to BM25 out of sheer pain.<\/p>\n<p>This is avoidable. Most production systems should be hybrid. The trick is knowing when pure vector is enough, and how to build a hybrid stack that doesn\u2019t turn into a Franken-search.<\/p>\n<h2>Where it breaks and why<\/h2>\n<p>These are the recurring hotspots:<\/p>\n<ul>\n<li>Access control and filters\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/12\/03\/rag-system-retrieving-wrong-data-heres-why\/\">Vector-only retrieval often ignores ACLs<\/a> and structured filters, or applies them post hoc which kills recall or performance.<\/li>\n<\/ul>\n<\/li>\n<li>Long-tail entity recall\n<ul>\n<li>Users type exact codes, SKUs, citations, or log signatures. <a href=\"https:\/\/angirash.in\/blog\/2025\/05\/08\/when-not-to-use-rag\/\">Dense embeddings blur those edges<\/a>. Sparse signals win here.<\/li>\n<\/ul>\n<\/li>\n<li>Language mix and abbreviations\n<ul>\n<li>Medical, finance, internal jargon. Lexical matching catches what generic embeddings miss.<\/li>\n<\/ul>\n<\/li>\n<li>Evaluation drift\n<ul>\n<li>Embedding model updates shift nearest neighbors. Teams discover this in production because they skipped regression suites.<\/li>\n<\/ul>\n<\/li>\n<li>Cost and tail latency\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/12\/18\/why-most-rag-architectures-break-under-real-user-load\/\">ANN plus reranking can be cheap at p50 and awful at p95<\/a>. Add multi-tenant filters and watch it spike.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>What teams misunderstand:<\/p>\n<ul>\n<li>\u201cVector databases handle everything.\u201d They handle nearest neighbor math. They don\u2019t solve query understanding, ACLs, or ranking policy.<\/li>\n<li>\u201cRerankers fix bad recall.\u201d Rerankers only rearrange what you already fetched. If your candidate set misses, you lose.<\/li>\n<li>\u201cChunk smaller for better recall.\u201d Over-chunking bloats indexes, hurts ranking, and increases hallucination risk in RAG.<\/li>\n<\/ul>\n<h2>Technical deep dive: architectures and trade-offs<\/h2>\n<h3>Vector-only retrieval<\/h3>\n<p>When it works:<br \/>\n&#8211; Small to medium corpora without complex filters<br \/>\n&#8211; Semantically fuzzy queries where exact match is rare<br \/>\n&#8211; You can tolerate occasional misses on exact strings<\/p>\n<p>Pitfalls:<br \/>\n&#8211; ACL prefiltering not supported or too slow<br \/>\n&#8211; Model drift changes neighbors silently<br \/>\n&#8211; Poor performance on code, formulas, identifiers<\/p>\n<h3>Classic hybrid: sparse + dense with rerank<\/h3>\n<p>What I recommend for most production search and RAG:<br \/>\n&#8211; Stage 1 candidates: union of<br \/>\n  &#8211; Sparse search (BM25 or learned sparse) with filters applied<br \/>\n  &#8211; ANN vector search with the same filters where supported<br \/>\n&#8211; Merge: simple linear fusion or Reciprocal Rank Fusion<br \/>\n&#8211; Stage 2 rerank: cross-encoder reranker on top N candidates<\/p>\n<p>Trade-offs:<br \/>\n&#8211; More moving parts, but easier to reason about failures<br \/>\n&#8211; Predictable recall on exact terms, strong coverage on fuzzy queries<\/p>\n<h3>Learned sparse and multi-vector models<\/h3>\n<p>Worth considering when you want hybrid benefits without two indexes:<br \/>\n&#8211; Learned sparse encoders: SPLADE, ELSER, Jina V2 sparse<br \/>\n&#8211; Multi-vector dense: ColBERT, bge-m3<\/p>\n<p>Pros:<br \/>\n&#8211; Strong lexical recall without managing synonyms<br \/>\n&#8211; Good zero-shot domain adaptation compared to plain BM25<\/p>\n<p>Cons:<br \/>\n&#8211; Index size grows fast<br \/>\n&#8211; Operational complexity for multi-vector storage and query plans<\/p>\n<h3>Filters and ACLs<\/h3>\n<ul>\n<li>Prefer prefiltering at retrieval time. Postfiltering after ANN can nuke recall and inflate latency.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/22\/vector-db-choice-can-kill-your-system\/\">Choose engines that support filtered ANN well<\/a>: Vespa, Elasticsearch\/OpenSearch kNN with filters, Weaviate hybrid with alpha, Qdrant sparse+dense, Milvus with scalar filters. Postgres pgvector can work with pg_trgm plus unions but needs careful tuning.<\/li>\n<\/ul>\n<h2>Failure modes you will hit<\/h2>\n<ul>\n<li>Embedding collapse after model update\n<ul>\n<li>Suddenly everything looks similar. Keep a frozen baseline model for A\/B.<\/li>\n<\/ul>\n<\/li>\n<li>Localization and multi-lingual mismatch\n<ul>\n<li>English-trained embeddings underperform on mixed-language corpora.<\/li>\n<\/ul>\n<\/li>\n<li>SKU and code queries\n<ul>\n<li>Dense-only misses exact strings like \u201cAB-1249-7C.\u201d Add a sparse leg.<\/li>\n<\/ul>\n<\/li>\n<li>Filtered tenant queries\n<ul>\n<li>ANN graphs degrade with hard filters. If your engine does not support filtered ANN natively, precompute per-tenant shards or use early prefilters.<\/li>\n<\/ul>\n<\/li>\n<li>Chunk abuse\n<ul>\n<li>100-token shards with 50-token overlaps balloon your index and hurt rerank quality. Use 200-500 tokens with light overlap unless you have very atomic facts.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>A practical build recipe that holds up<\/h2>\n<p>This is what I deploy when a team is moving from PoC to prod.<\/p>\n<h3>Indexing pipeline<\/h3>\n<ul>\n<li>Normalization\n<ul>\n<li>Language detect, strip boilerplate, extract fields, compute fingerprints for dedup.<\/li>\n<\/ul>\n<\/li>\n<li>Chunking\n<ul>\n<li>250-400 tokens, 10-20% overlap. <a href=\"https:\/\/angirash.in\/blog\/2025\/02\/24\/chunking-that-actually-improves-retrieval\/\">Preserve section hierarchy and titles in metadata<\/a>.<\/li>\n<\/ul>\n<\/li>\n<li>Embeddings\n<ul>\n<li>One dense model per language or a strong multilingual model. <a href=\"https:\/\/angirash.in\/blog\/2025\/05\/14\/embedding-quality-over-model-choice\/\">Cache per document hash. Version everything<\/a>.<\/li>\n<\/ul>\n<\/li>\n<li>Sparse\n<ul>\n<li>BM25 in Elasticsearch\/OpenSearch or learned sparse if you can afford the index size.<\/li>\n<\/ul>\n<\/li>\n<li>Metadata\n<ul>\n<li>Store ACLs, tenants, timestamps, doc type, and source. Keep them queryable and indexable.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Query processing<\/h3>\n<ul>\n<li>Lightweight normalization\n<ul>\n<li>Lowercase, trim, handle quotes for exact modes, detect code-like tokens.<\/li>\n<\/ul>\n<\/li>\n<li>Route\n<ul>\n<li>If query looks like an identifier or exact phrase, upweight sparse. If it is a natural question, upweight dense.<\/li>\n<\/ul>\n<\/li>\n<li>Caching\n<ul>\n<li>Cache embeddings for frequent queries. Cache top-K results keyed by normalized query + filter signature for a short TTL.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Retrieval<\/h3>\n<ul>\n<li>Run sparse and dense in parallel with the same filters<\/li>\n<li>Fetch K1 from sparse, K2 from dense. Typical start: K1=200, K2=200 for union size under 300 after dedup<\/li>\n<li>Merge with Reciprocal Rank Fusion or weighted score sum. Start with 0.5\/0.5 and tune per dataset<\/li>\n<\/ul>\n<h3>Reranking<\/h3>\n<ul>\n<li>Cross-encoder on top 50-100 candidates<\/li>\n<li>If GPU, run a small fast reranker. If CPU-only, use an ONNX-optimized model or a hosted rerank API<\/li>\n<li>For RAG, pass top 10-20 to the LLM. Do not feed the LLM 100 chunks and hope it sorts it out<\/li>\n<\/ul>\n<h3>Guardrails and observability<\/h3>\n<ul>\n<li>Log query traces with candidate sources, scores, filters, and final selection<\/li>\n<li>Track recall proxy\n<ul>\n<li>Compute fraction of answers where sparse-only or dense-only would have missed the final answer<\/li>\n<\/ul>\n<\/li>\n<li>Maintain a labeled eval set\n<ul>\n<li>At least a few hundred queries per tenant or product area, refreshed quarterly<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/11\/12\/ai-observability-stop-guessing-start-instrumenting\/\">Canary model updates with shadow traffic<\/a>, compare NDCG and recall@50, watch p95 latency<\/li>\n<\/ul>\n<h2>Cost and performance math that matters<\/h2>\n<ul>\n<li>Storage\n<ul>\n<li>Hybrid means larger indexes. Dense 768-d float16 vectors are ~1.5 KB per chunk. Learned sparse can add several KB. Budget for 2 to 5x growth vs raw text.<\/li>\n<\/ul>\n<\/li>\n<li>Latency\n<ul>\n<li>Target retrieval under 100 ms p95 with filters. Reranker adds 15 to 60 ms depending on model and batch size. Keep total pre-LLM under 150 ms if your app is interactive.<\/li>\n<\/ul>\n<\/li>\n<li>Compute\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/09\/07\/accuracy-latency-tradeoff-ai-systems\/\">Rerankers dominate cost at scale<\/a>. Batch within a single query. Consider a small reranker for online, large one for nightly reindex or precomputation of static ranks.<\/li>\n<\/ul>\n<\/li>\n<li>Token costs\n<ul>\n<li>Better recall reduces hallucinations and shrinks context. <a href=\"https:\/\/angirash.in\/blog\/2025\/08\/14\/why-your-rag-pipeline-is-slow-and-expensive\/\">Cutting 1 to 2 chunks per request<\/a> at GPT-4 class pricing pays for your reranker many times over.<\/li>\n<\/ul>\n<\/li>\n<li>Multi-tenant isolation\n<ul>\n<li>Cross-tenant recall bugs are expensive. Prefer per-tenant collections or tight filters that the engine can push down to ANN. Sharding by tenant reduces tail latency variance.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>What to choose when<\/h2>\n<ul>\n<li>Pure vector\n<ul>\n<li>Small corpus, low filter complexity, Q&amp;A style queries, fast iteration. Expect some misses on exact strings. Cheap to operate.<\/li>\n<\/ul>\n<\/li>\n<li>Hybrid sparse + dense with rerank\n<ul>\n<li>Default for enterprise search and production RAG. Handles filters, ACL, jargon, and long-tail. Slightly higher operational overhead, much better reliability.<\/li>\n<\/ul>\n<\/li>\n<li>Learned sparse or multi-vector\n<ul>\n<li>When you want hybrid behavior with fewer moving parts and can handle larger indexes and more complex query plans.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Most production systems need hybrid retrieval. Vector-only is a demo default, not a production default.<\/li>\n<li>Retrieval quality lives and dies on candidate recall. Rerankers cannot rescue missing candidates.<\/li>\n<li>Apply filters at retrieval time. Postfiltering ANN is a recall and latency trap.<\/li>\n<li>Keep chunks reasonable and hierarchical. Over-chunking increases cost and hurts answers.<\/li>\n<li>Treat embeddings as versioned models. Shadow test updates, measure, then roll.<\/li>\n<li>Track recall proxies and per-tenant metrics. You need observability to avoid silent regressions.<\/li>\n<\/ul>\n<h2>If you want a sanity check<\/h2>\n<p>If you\u2019re sitting on a vector-only stack that works fine in staging and falls apart with real users, you\u2019re not alone. I help teams move to hybrid retrieval without blowing up latency or cost. If you want a quick architecture review or a hands-on fix, reach out. This is exactly the kind of thing I debug when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The painful pattern The vector-only demo looks great in a sandbox. Then you ship and support tickets pile up. Acronyms don\u2019t resolve, filters don\u2019t filter, legal asks for deterministic behavior,&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3],"tags":[17,13,18],"class_list":["post-68","post","type-post","status-publish","format-standard","hentry","category-ai-architecture","tag-ai-cost","tag-rag","tag-vector-db"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/68","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=68"}],"version-history":[{"count":2,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/68\/revisions"}],"predecessor-version":[{"id":130,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/68\/revisions\/130"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=68"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=68"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=68"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}