The painful pattern
The vector-only demo looks great in a sandbox. Then you ship and support tickets pile up. Acronyms don’t resolve, filters don’t filter, legal asks for deterministic behavior, and your latency SLO gets wrecked by a reranker bolted on at the last minute. I’ve seen teams rip out a fresh vector stack and crawl back to BM25 out of sheer pain.
This is avoidable. Most production systems should be hybrid. The trick is knowing when pure vector is enough, and how to build a hybrid stack that doesn’t turn into a Franken-search.
Where it breaks and why
These are the recurring hotspots:
- Access control and filters
- Vector-only retrieval often ignores ACLs and structured filters, or applies them post hoc which kills recall or performance.
- Long-tail entity recall
- Users type exact codes, SKUs, citations, or log signatures. Dense embeddings blur those edges. Sparse signals win here.
- Language mix and abbreviations
- Medical, finance, internal jargon. Lexical matching catches what generic embeddings miss.
- Evaluation drift
- Embedding model updates shift nearest neighbors. Teams discover this in production because they skipped regression suites.
- Cost and tail latency
- ANN plus reranking can be cheap at p50 and awful at p95. Add multi-tenant filters and watch it spike.
What teams misunderstand:
- “Vector databases handle everything.” They handle nearest neighbor math. They don’t solve query understanding, ACLs, or ranking policy.
- “Rerankers fix bad recall.” Rerankers only rearrange what you already fetched. If your candidate set misses, you lose.
- “Chunk smaller for better recall.” Over-chunking bloats indexes, hurts ranking, and increases hallucination risk in RAG.
Technical deep dive: architectures and trade-offs
Vector-only retrieval
When it works:
– Small to medium corpora without complex filters
– Semantically fuzzy queries where exact match is rare
– You can tolerate occasional misses on exact strings
Pitfalls:
– ACL prefiltering not supported or too slow
– Model drift changes neighbors silently
– Poor performance on code, formulas, identifiers
Classic hybrid: sparse + dense with rerank
What I recommend for most production search and RAG:
– Stage 1 candidates: union of
– Sparse search (BM25 or learned sparse) with filters applied
– ANN vector search with the same filters where supported
– Merge: simple linear fusion or Reciprocal Rank Fusion
– Stage 2 rerank: cross-encoder reranker on top N candidates
Trade-offs:
– More moving parts, but easier to reason about failures
– Predictable recall on exact terms, strong coverage on fuzzy queries
Learned sparse and multi-vector models
Worth considering when you want hybrid benefits without two indexes:
– Learned sparse encoders: SPLADE, ELSER, Jina V2 sparse
– Multi-vector dense: ColBERT, bge-m3
Pros:
– Strong lexical recall without managing synonyms
– Good zero-shot domain adaptation compared to plain BM25
Cons:
– Index size grows fast
– Operational complexity for multi-vector storage and query plans
Filters and ACLs
- Prefer prefiltering at retrieval time. Postfiltering after ANN can nuke recall and inflate latency.
- Choose engines that support filtered ANN well: Vespa, Elasticsearch/OpenSearch kNN with filters, Weaviate hybrid with alpha, Qdrant sparse+dense, Milvus with scalar filters. Postgres pgvector can work with pg_trgm plus unions but needs careful tuning.
Failure modes you will hit
- Embedding collapse after model update
- Suddenly everything looks similar. Keep a frozen baseline model for A/B.
- Localization and multi-lingual mismatch
- English-trained embeddings underperform on mixed-language corpora.
- SKU and code queries
- Dense-only misses exact strings like “AB-1249-7C.” Add a sparse leg.
- Filtered tenant queries
- ANN graphs degrade with hard filters. If your engine does not support filtered ANN natively, precompute per-tenant shards or use early prefilters.
- Chunk abuse
- 100-token shards with 50-token overlaps balloon your index and hurt rerank quality. Use 200-500 tokens with light overlap unless you have very atomic facts.
A practical build recipe that holds up
This is what I deploy when a team is moving from PoC to prod.
Indexing pipeline
- Normalization
- Language detect, strip boilerplate, extract fields, compute fingerprints for dedup.
- Chunking
- 250-400 tokens, 10-20% overlap. Preserve section hierarchy and titles in metadata.
- Embeddings
- One dense model per language or a strong multilingual model. Cache per document hash. Version everything.
- Sparse
- BM25 in Elasticsearch/OpenSearch or learned sparse if you can afford the index size.
- Metadata
- Store ACLs, tenants, timestamps, doc type, and source. Keep them queryable and indexable.
Query processing
- Lightweight normalization
- Lowercase, trim, handle quotes for exact modes, detect code-like tokens.
- Route
- If query looks like an identifier or exact phrase, upweight sparse. If it is a natural question, upweight dense.
- Caching
- Cache embeddings for frequent queries. Cache top-K results keyed by normalized query + filter signature for a short TTL.
Retrieval
- Run sparse and dense in parallel with the same filters
- Fetch K1 from sparse, K2 from dense. Typical start: K1=200, K2=200 for union size under 300 after dedup
- Merge with Reciprocal Rank Fusion or weighted score sum. Start with 0.5/0.5 and tune per dataset
Reranking
- Cross-encoder on top 50-100 candidates
- If GPU, run a small fast reranker. If CPU-only, use an ONNX-optimized model or a hosted rerank API
- For RAG, pass top 10-20 to the LLM. Do not feed the LLM 100 chunks and hope it sorts it out
Guardrails and observability
- Log query traces with candidate sources, scores, filters, and final selection
- Track recall proxy
- Compute fraction of answers where sparse-only or dense-only would have missed the final answer
- Maintain a labeled eval set
- At least a few hundred queries per tenant or product area, refreshed quarterly
- Canary model updates with shadow traffic, compare NDCG and recall@50, watch p95 latency
Cost and performance math that matters
- Storage
- Hybrid means larger indexes. Dense 768-d float16 vectors are ~1.5 KB per chunk. Learned sparse can add several KB. Budget for 2 to 5x growth vs raw text.
- Latency
- Target retrieval under 100 ms p95 with filters. Reranker adds 15 to 60 ms depending on model and batch size. Keep total pre-LLM under 150 ms if your app is interactive.
- Compute
- Rerankers dominate cost at scale. Batch within a single query. Consider a small reranker for online, large one for nightly reindex or precomputation of static ranks.
- Token costs
- Better recall reduces hallucinations and shrinks context. Cutting 1 to 2 chunks per request at GPT-4 class pricing pays for your reranker many times over.
- Multi-tenant isolation
- Cross-tenant recall bugs are expensive. Prefer per-tenant collections or tight filters that the engine can push down to ANN. Sharding by tenant reduces tail latency variance.
What to choose when
- Pure vector
- Small corpus, low filter complexity, Q&A style queries, fast iteration. Expect some misses on exact strings. Cheap to operate.
- Hybrid sparse + dense with rerank
- Default for enterprise search and production RAG. Handles filters, ACL, jargon, and long-tail. Slightly higher operational overhead, much better reliability.
- Learned sparse or multi-vector
- When you want hybrid behavior with fewer moving parts and can handle larger indexes and more complex query plans.
Key takeaways
- Most production systems need hybrid retrieval. Vector-only is a demo default, not a production default.
- Retrieval quality lives and dies on candidate recall. Rerankers cannot rescue missing candidates.
- Apply filters at retrieval time. Postfiltering ANN is a recall and latency trap.
- Keep chunks reasonable and hierarchical. Over-chunking increases cost and hurts answers.
- Treat embeddings as versioned models. Shadow test updates, measure, then roll.
- Track recall proxies and per-tenant metrics. You need observability to avoid silent regressions.
If you want a sanity check
If you’re sitting on a vector-only stack that works fine in staging and falls apart with real users, you’re not alone. I help teams move to hybrid retrieval without blowing up latency or cost. If you want a quick architecture review or a hands-on fix, reach out. This is exactly the kind of thing I debug when systems start breaking at scale.

