{"id":69,"date":"2025-05-08T10:32:18","date_gmt":"2025-05-08T10:32:18","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/when-not-to-use-rag\/"},"modified":"2026-04-10T19:29:08","modified_gmt":"2026-04-10T19:29:08","slug":"when-not-to-use-rag","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/05\/08\/when-not-to-use-rag\/","title":{"rendered":"When RAG Makes Your AI Worse: Hard Rules From Production"},"content":{"rendered":"<h2>The trap<\/h2>\n<p>Half the RAG projects I\u2019m asked to review would be simpler, cheaper, and more reliable without a vector index. Teams add retrieval because every diagram on the internet shows one. Then they fight stale chunks, latency blowups, ACL leaks, and \u201cwhy did it cite that PDF from 2021.\u201d Many organizations struggle with <a href=\"https:\/\/angirash.in\/blog\/2025\/07\/22\/ai-demo-trap-business-value\/\">closing the ai demo gap<\/a>, often focusing too much on technology instead of the user experience. By prioritizing clear communication and accessibility, teams can demonstrate the real-world applications of their projects more effectively. Emphasizing practical use cases over technical jargon will help bridge this critical gap.<\/p>\n<p>If your goal is a system that ships and stays up, you need a clear bar for when RAG is the right tool and when it is a liability.<\/p>\n<h2>Where this goes wrong<\/h2>\n<ul>\n<li>Internal policy bots that must output the canonical answer. They wire up RAG, then spend months chasing determinism that retrieval cannot deliver.<\/li>\n<li>Analytics Q&amp;A on top of warehouses. They try to retrieve docs about metrics instead of generating and validating SQL. Users get confident lies.<\/li>\n<li>High-churn knowledge bases. Docs change hourly. Indexing pipelines lag. The bot answers with yesterday\u2019s policy and legal gets involved.<\/li>\n<li>Multi-tenant apps with strict ACLs. One tenant\u2019s doc bleeds into another via a shared index. Now you are doing IRM with embeddings.<\/li>\n<li>Latency-sensitive flows. RAG adds 2 to 4 network hops. P95 jumps from 300 ms to 1.2 s and your conversion drops.<\/li>\n<\/ul>\n<p>What most teams misunderstand: retrieval is not a truth machine. It is <a href=\"https:\/\/angirash.in\/blog\/2025\/12\/03\/rag-system-retrieving-wrong-data-heres-why\/\">fuzzy recall glued to a probabilistic reasoner<\/a>. If you need guaranteed correctness, you need something else in the loop. To enhance the effectiveness of retrieval systems, teams should focus on <a href=\"https:\/\/angirash.in\/blog\/2025\/09\/07\/accuracy-latency-tradeoff-ai-systems\/\">accuracylatency optimization strategies for AI<\/a> that balance performance with reliability. By incorporating these strategies, organizations can improve their systems&#8217; ability to process vast amounts of data while maintaining a high level of precision. This approach not only boosts user confidence but also elevates the overall functionality of AI applications.<\/p>\n<h2>Technical deep dive: when not to use RAG<\/h2>\n<h3>1) You need deterministic, auditable answers<\/h3>\n<p>Pricing, SLAs, refund rules, regulated disclosures. If legal or finance must sign off on the exact wording, do not let retrieval pick sources. Use a knowledge service that returns the canonical record and let the LLM only shape the language around it.<\/p>\n<p>Pattern: tool call to policy-service -> structured record -> templated generation with citation to the record\u2019s id. No vector DB.<\/p>\n<h3>2) The task is over structured data, not documents<\/h3>\n<p>\u201cRevenue by region last quarter,\u201d \u201cTop 5 churn reasons,\u201d \u201cCompare plan features.\u201d RAG over docs about the data is the wrong abstraction. You want SQL generation with schema-aware planning, guardrails, and result validation.<\/p>\n<p>Pattern: intent router -> SQL agent -> execution sandbox -> post-hoc checks -> final natural language summary. Retrieval is optional for examples, not the backbone.<\/p>\n<h3>3) You must aggregate or join at scale<\/h3>\n<p>Anything that requires combining many rows or sources will blow past retrieval recall. Pulling 3 to 10 snippets from a vector index will not cover a join across millions of rows. Use computation, not recall.<\/p>\n<h3>4) Small, stable corpus with high-precision needs<\/h3>\n<p>If you have a few thousand FAQs or policies and they rarely change, lexical search with a strong re-ranker is faster and more precise. For exact strings, acronyms, and codes, BM25 plus a cross-encoder re-ranker beats embeddings more often than people admit.<\/p>\n<h3>5) Freshness SLA under a few minutes<\/h3>\n<p>If your content churns and updates must reflect within minutes, <a href=\"https:\/\/angirash.in\/blog\/2025\/02\/24\/chunking-that-actually-improves-retrieval\/\">your embedding pipeline becomes the bottleneck<\/a>. You will either return stale answers or eat higher costs to over-index. Prefer direct API reads from the system of record.<\/p>\n<h3>6) Complex access control and multi-tenancy<\/h3>\n<p>Per-document ACLs, row-level security, data residency, or regulated tenants. <a href=\"https:\/\/angirash.in\/blog\/2025\/04\/14\/stateless-vs-stateful-ai-systems-what-works-at-scale\/\">A shared vector index with filter queries looks fine in a deck and fails in audits<\/a>. Even if filters work logically, snippet leakage through paraphrase is real. Build per-tenant indices or skip retrieval.<\/p>\n<h3>7) Jargon-heavy or multilingual domains without adaptation<\/h3>\n<p><a href=\"https:\/\/angirash.in\/blog\/2025\/05\/14\/embedding-quality-over-model-choice\/\">Generic embeddings miss domain synonyms and code words<\/a>. You will spend weeks tuning chunking and still get off-target context. Either adapt retrieval to the domain with sparse methods or avoid retrieval entirely.<\/p>\n<h3>8) Tight latency budgets<\/h3>\n<p><a href=\"https:\/\/angirash.in\/blog\/2025\/04\/21\/llm-latency-in-production-what-actually-works\/\">Under 300 ms P95 end to end, RAG usually does not fit<\/a>. You pay for: routing, retrieval, re-ranking, LLM. With network hops, you are done before the model even starts. Use cached answers, deterministic tools, or on-box models with local stores.<\/p>\n<h3>9) Strong audit and citation guarantees<\/h3>\n<p>If you must prove why the model answered a certain way, <a href=\"https:\/\/angirash.in\/blog\/2025\/03\/18\/designing-ai-systems-for-reliability-not-just-accuracy\/\">relying on fuzzy semantic matches is risky<\/a>. Hard-link to record ids or versioned policy objects. Put those ids in the final answer.<\/p>\n<h3>10) Privacy constraints on embeddings<\/h3>\n<p>If you cannot embed PII or sensitive text, and redaction is non-trivial, <a href=\"https:\/\/angirash.in\/blog\/2025\/03\/18\/when-ai-is-the-wrong-solution\/\">retrieval becomes brittle or non-compliant<\/a>. Do not push raw secrets into a vector DB to \u201cdeal with later.\u201d<\/p>\n<h2>Architecture-level alternatives that actually work<\/h2>\n<h3>A. Deterministic knowledge services<\/h3>\n<ul>\n<li>What: A microservice that returns canonical answers or records by key, with versioning.<\/li>\n<li>How it looks: LLM -> tool call get_policy(policy_id) -> record -> template render -> minimal generation for tone.<\/li>\n<li>Pros: Deterministic, auditable, fast.<\/li>\n<li>Cons: You need to define keys and keep the service current. Not a doc dump.<\/li>\n<\/ul>\n<h3>B. SQL or DSL agents for analytics<\/h3>\n<ul>\n<li>What: LLM generates queries against a governed schema, with static analysis and execution guardrails.<\/li>\n<li>Pros: Correct by construction if you validate and type-check. Scales with data size.<\/li>\n<li>Cons: More upfront work on schema curation and sandboxing.<\/li>\n<\/ul>\n<h3>C. Lexical + re-ranker search<\/h3>\n<ul>\n<li>What: <a href=\"https:\/\/angirash.in\/blog\/2025\/10\/02\/hybrid-search-vs-vector-search-production\/\">BM25 or SPLADE for recall, cross-encoder for top-k<\/a> re-ranking, no generation or minimal generation.<\/li>\n<li>Pros: Strong on exact terms, acronyms, and short answers. Cheap and fast.<\/li>\n<li>Cons: Weaker on paraphrase-heavy queries.<\/li>\n<\/ul>\n<h3>D. Precomputed answer library with TTL<\/h3>\n<ul>\n<li>What: Normalize common questions, generate vetted answers offline, key them by normalized query and tenant, and serve from cache.<\/li>\n<li>Pros: Sub-100 ms responses, consistent language.<\/li>\n<li>Cons: Coverage limited to the head queries. Needs curation.<\/li>\n<\/ul>\n<h3>E. Knowledge graph or keyed entities<\/h3>\n<ul>\n<li>What: Entities, relations, and attributes in a graph or document store. The model queries by id, not by fuzz.<\/li>\n<li>Pros: Traceable, great for regulated domains.<\/li>\n<li>Cons: You must model the world. Worth it when stakes are high.<\/li>\n<\/ul>\n<h3>F. Light fine-tuning for style, not facts<\/h3>\n<ul>\n<li>What: Fine-tune to reduce prompt bloat and enforce format. Keep facts out of the model, keep them in systems of record or tools.<\/li>\n<li>Pros: Lower inference tokens, more consistent outputs.<\/li>\n<li>Cons: Needs a data pipeline. Does not replace live data.<\/li>\n<\/ul>\n<h2>Failure modes I see most often with RAG<\/h2>\n<ul>\n<li>Chunking mania: teams chase chunk size and overlap instead of fixing document structure. Garbage in, garbage out.<\/li>\n<li>Embedding drift: upgrade the embedding model and recall quietly drops. Now your evaluation suite is lying to you.<\/li>\n<li>ACL filters are leaky: tenant filters fail at edges or during backfills. One bad join, one leaked doc.<\/li>\n<li>Index sprawl: per-tenant indices explode infrastructure and ops. Re-embedding becomes a monthly tax.<\/li>\n<li>Latency cliffs: adding a re-ranker and citations doubles P95. Product teams hide the spinner and call it done.<\/li>\n<\/ul>\n<h2>Practical decision rules<\/h2>\n<p>Use this as a rough router before you start building.<\/p>\n<ul>\n<li>Is there a single source of truth with a stable key? Use a deterministic tool or service. No RAG.<\/li>\n<li>Is the task numeric, aggregative, or requires joins? Use SQL or a DSL agent. No RAG.<\/li>\n<li>Is the corpus under 10k docs, low churn, and precision matters? Use lexical + re-ranker. No RAG.<\/li>\n<li>Do you need sub-300 ms P95? Prefer cached answers, tools, or on-box models. Avoid multi-hop RAG.<\/li>\n<li>Are there strict ACLs or regulated tenants? Per-tenant indices or deterministic sources. Default to no shared RAG.<\/li>\n<li>Do you need fresh answers under 5 minutes and documents change often? Hit systems of record directly.<\/li>\n<li>Everything else, especially long-form synthesis across messy content: consider RAG, but prove it with evals before rollout.<\/li>\n<\/ul>\n<p>A concrete router pattern I use:<\/p>\n<ul>\n<li>Step 1: Intent classifier with a tiny model: policy_lookup, data_analytics, faq_search, synthesis.<\/li>\n<li>Step 2: Route\n<ul>\n<li>policy_lookup -> policy-service tool -> templated answer<\/li>\n<li>data_analytics -> SQL agent -> validated query -> summary<\/li>\n<li>faq_search -> lexical + re-rank -> optional small LLM polish<\/li>\n<li>synthesis -> RAG with tight retrieval budget, per-tenant isolation, and offline evals<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Business impact you can forecast<\/h2>\n<ul>\n<li>Cost: For many teams, embedding and index maintenance ends up 30 to 60 percent of total LLM spend once you scale beyond a few million tokens per day of new content. Add storage, re-index jobs, and re-embeddings when models change.<\/li>\n<li>Latency: Each retrieval hop adds a round trip and token overhead. Expect 300 to 800 ms extra at P95 for naive RAG. Re-rankers and citation extraction add more.<\/li>\n<li>Risk: ACL mistakes with RAG are near the top of my incident list. One leaked snippet can burn months of trust. Deterministic services are boring, and boring keeps customers.<\/li>\n<li>Team velocity: RAG pushes you into IR, data modeling, and ops all at once. If what you needed was a policy bot, you just took on a search platform problem.<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>If you can get the answer from a key, an API, or a query, do that. Skip retrieval.<\/li>\n<li>Retrieval is for messy synthesis across unstructured text, not for truth or aggregates.<\/li>\n<li>Lexical + re-rank is underrated. Try it before spinning up a vector stack.<\/li>\n<li>Multi-tenant and regulated environments should default to deterministic sources.<\/li>\n<li>Freshness and latency SLAs often rule out RAG before you write code.<\/li>\n<li>Build a router. One size does not fit all queries.<\/li>\n<\/ul>\n<h2>If this sounds familiar<\/h2>\n<p>If you\u2019re staring at a RAG diagram and your requirements look like determinism, ACLs, or sub-300 ms, you probably do not want RAG at the core. I help teams replace fragile retrieval with simpler, testable architecture, or put RAG only where it pulls its weight. If your system is wobbling at scale, this is exactly the kind of thing I fix. As systems grow more complex, <a href=\"https:\/\/angirash.in\/blog\/2025\/07\/14\/debugging-ai-systems-harder-than-software\/\">challenges in ai system debugging<\/a> become increasingly prevalent. These issues can lead to unexpected behaviors and failures, making it essential to have robust monitoring and testing practices in place. By prioritizing clarity and simplicity in architecture, teams can mitigate these challenges and ensure more stable deployments.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The trap Half the RAG projects I\u2019m asked to review would be simpler, cheaper, and more reliable without a vector index. Teams add retrieval because every diagram on the internet&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3],"tags":[17,15,13],"class_list":["post-69","post","type-post","status-publish","format-standard","hentry","category-ai-architecture","tag-ai-cost","tag-ai-system-design","tag-rag"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/69","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=69"}],"version-history":[{"count":3,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/69\/revisions"}],"predecessor-version":[{"id":191,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/69\/revisions\/191"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=69"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=69"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=69"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}