Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The painful symptom

You ask your help-bot about Product A’s refund policy and it cites Product B. Your sales assistant quotes a deprecated price sheet. Your internal search keeps pulling marketing pages instead of the compliance PDF your auditors care about. The LLM looks guilty, but the culprit is usually upstream: retrieval.

I’ve watched teams ship “RAG 1.0” that looks fine in a demo, then crumble with real traffic. The pattern is consistent. Retrieval isn’t returning the right candidates, then the LLM politely hallucinates around the wrong context.

Where this shows up and why

Customer support copilots return answers from a sibling product because tenant or product filters are loose.
Enterprise search surfaces a glossary entry when the user needs a troubleshooting runbook. Short queries amplify this.
Policy and legal answers cite obsolete documents due to stale indexes or weak recency signals.
Sales assistants prefer blog posts over price sheets because embeddings overweight narrative prose and you chunked tables badly.

What most teams misunderstand:
– “Better LLM” does not fix bad retrieval. Garbage in, eloquent garbage out.
– One embedding model to rule them all is a trap. Domain and format matter.
– Rerankers cannot rescue missing candidates. If the right chunk never makes top-k, it never gets seen.

How retrieval actually breaks

Think in pipeline terms. A typical RAG stack looks like this:
1) Query comes in
2) Optional query transform or expansion
3) Candidate generation from vector DB and sometimes BM25
4) Reranking and filtering
5) Context assembly
6) Generation
7) Caching, logging, feedback

Failures hide in each step.

1) Query construction drift

Aggressive query rewriting (HyDE, synonym explosion) can push intent off-target. I’ve seen a “pricing” question rewritten into “discount policy,” which flips the result set.
Language mismatch. Users type Spanish, your corpus is English, your embedder is multilingual but you didn’t normalize. Cosine scores look fine, semantics are not.

2) Embedding mismatch and version drift

You reindexed with a new embedding model but forgot to re-embed queries in the same model at runtime. Similarity collapses silently. Improving embedding quality over model choice is often more impactful than upgrading your LLM.
Mixed normalization. Some data was lowercased and punctuation-stripped, some wasn’t. Tiny but real score shifts that reorder top-10.
Wrong similarity metric. Using dot product with a model trained for cosine without normalization changes ranking.

3) Chunking and structure loss

Blind fixed-size chunks cut headers from content. The chunk that ranks top has no clue it belongs to “Product B Policies.”
Tables and code blocks get flattened. Prices or parameters lose headers and units, so lexical and semantic signals weaken.
Overlap set too small. You miss cross-sentence semantics that anchor meaning.

4) Metadata and permission filters

Filters applied post-retrieval instead of pre-retrieval. You retrieve 100 global items, then drop 95 for permissions, leaving 5 weak matches.
Multi-tenant or multi-product scope not embedded into the index or the query filter. Leakage is one bug away.

5) Candidate generation tuning and index health

HNSW parameters mis-tuned. ef_search too low means faster but wrong neighbors.
Over-compressed vectors with PQ or scalar quantization degrade recall more than your UX can tolerate.
Stale replicas. Indexing lag makes new docs invisible for hours. Support teams assume the bot is “lying.”

6) Reranking misuse

Cross-encoders truncating long chunks. What they score is the first 512 tokens, not the part with the answer.
Reranking from too small a candidate set. If you only take top-10 from ANN then rerank, true positives that were rank 25 never get a chance. Understanding the accuracy-latency trade-off in production AI helps you decide how many candidates to rerank within your latency budget.
No diversity control. You return 8 variants of the same marketing paragraph instead of covering different sections.

7) Context assembly and caching

Picking top-k by score only. Context assembly failures are often a state management issue — understanding stateless vs stateful context assembly clarifies how to keep retrieved context consistent. You exceed the model’s context with redundant chunks and cut the one section with the answer.
Cache poisoning. Query-level caches that ignore tenant or locale echo previous user’s context into a different session.

Practical fixes that actually move the needle

This is the checklist I use when a team calls about “wrong answers.” Start at the top and stop when metrics stabilize.

Quick triage

Verify the embedder used for data equals the one used for queries. Log model name and version with every retrieval.
Confirm similarity metric and vector normalization settings match the embedder’s training assumptions.
Pull 50 real queries. Inspect top-10 candidates with scores and metadata. If it feels wrong to a human, it is.

Structure-aware chunking

Chunk by document structure, not character count. Preserve section path like Product -> Policy -> Refunds.
Include headings and breadcrumbs with each chunk. Store numeric ranges and table headers explicitly.
Use sentence-aware windows with 20 to 30 percent overlap. Avoid splitting entities across chunks.

Get filters right at retrieval time

Enforce tenant, product, and permission filters inside the vector query. Do not retrieve globally then filter.
Store scope as discrete fields. Avoid fuzzy text filtering for access control.
Add unit tests for permission leakage with synthetic tenants and canaries.

Hybrid retrieval by default, not by faith

Combine dense vectors with BM25 or a learned sparse model. Weighting: start 0.6 dense, 0.4 lexical, then tune.
Short or head queries often need stronger lexical weight. Long questions lean dense.
Use MMR or group-by-document to ensure diversity across candidates.

Reranker that fits your domain

Rerank top-50 or top-100, not top-10. Measure the latency impact and cut if it hurts SLAs.
Truncate smartly. Feed the reranker focused windows around likely answer spans, not raw long chunks.
Consider a domain-tuned cross-encoder if you live on tables, code, or legal text. Off-the-shelf models underweight these.

Keep indexes healthy and consistent

Reindex the full corpus when you change embedding models. No partial mixes.
Tune HNSW: increase ef_search until recall@20 plateaus under your latency budget. Document the setpoint. The vector DB choice affects recall in ways that go beyond HNSW tuning — index type and filter strategy shape what you can retrieve.
Avoid aggressive PQ on critical collections. If you must compress, measure recall drop, not just storage savings.
Track indexing lag. Expose last_indexed_at per doc and alert if lag exceeds your freshness SLO.

Query transforms with guardrails

Keep expansions minimal. Log original and transformed queries side by side.
Entity normalization is usually safe. Full semantic rewriting is where drift appears. Make it opt-in per route.

Observability and evaluation that isolates retrieval

Building observability that isolates retrieval from generation is the prerequisite for diagnosing these failures systematically. Log every retrieval with: query text, query vector hash, index version, embedder version, filters used, top-k IDs and scores.
Maintain a golden set of query->relevant chunk IDs sourced from tickets, search logs, and SME annotations.
Track retrieval recall@k and NDCG separately from generation metrics. If retrieval dips, stop blaming the LLM.
Monitor score distribution shifts. If cosine similarities compress or spread suddenly, something changed in content or embeddings.

Trade-offs you should decide explicitly

Latency vs recall: Higher ef_search and bigger rerank pools improve recall but hit p95. Choose based on the product surface that consumes the answer.
Freshness vs cost: Real-time indexing pipelines cost more than nightly batches. If your domain changes hourly, pay it. If not, don’t.
Compression vs accuracy: PQ can cut storage 4x and drop recall 3 to 7 points. For policy and legal, that is usually too expensive.

Business impact you can measure

Wrong retrieval inflates token spend. The LLM babbles longer to compensate and still gets it wrong.
Latency creep. Cranking up k to hide poor recall adds hundreds of milliseconds, then seconds under load.
Trust erosion. One cross-tenant leakage or outdated legal answer costs more than any vector DB bill.
Scale risks. At 10x documents, your current HNSW settings and reranker throughput may collapse p95 from 800 ms to 3 s. Customers won’t wait.

Key takeaways

Retrieval quality is an engineering problem, not a prompt problem.
Use structure-aware chunking and strict pre-retrieval filters. Most leaks and misfires die there.
Hybrid search with a tuned reranker beats dense-only in real traffic.
Version and log everything: embedder, index, filters, top-k. You cannot fix what you cannot see.
Reindex on embedder changes. No half-migrations.
Set explicit latency and recall targets. Tune to the target, not vibes.

If you need help

If this sounds familiar, you’re not alone. I help teams untangle retrieval pipelines, put real evaluations in place, and hit reliability targets without lighting money on fire. If your RAG stack is giving the wrong answers, this is the kind of work I do when systems start breaking at scale.