{"id":111,"date":"2025-07-14T10:42:31","date_gmt":"2025-07-14T10:42:31","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/more-data-wont-fix-your-ai-system\/"},"modified":"2025-07-14T10:42:31","modified_gmt":"2025-07-14T10:42:31","slug":"more-data-wont-fix-your-ai-system","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/07\/14\/more-data-wont-fix-your-ai-system\/","title":{"rendered":"More Data Won\u2019t Fix Your AI System"},"content":{"rendered":"<h2>The common failure mode: \u201clet\u2019s just add more data\u201d<\/h2>\n<p>I see this play out every quarter. Metrics flatten, users complain about wrong answers, latency creeps up. Someone proposes a fix that feels obvious: add more data. More docs in the vector store. More synthetic fine-tune samples. More logs in the training set. Then things get slower, costlier, and only marginally better. Sometimes the system gets worse.<\/p>\n<p>If your retrieval or model is mis-specified, more data increases entropy. You raise the ceiling on noise, not signal.<\/p>\n<h2>Where this shows up and why it happens<\/h2>\n<ul>\n<li>RAG that keeps getting dumber: You add 10x more documents to \u201ccover edge cases.\u201d Top-k precision drops, context windows fill with near-duplicates, the model hallucinates connecting the wrong snippets.<\/li>\n<li>Fine-tunes that regress: Teams pile synthetic Q&amp;A on top of mixed-quality real data. Loss goes down. Real user tasks degrade because the model learned to win benchmarks you don\u2019t care about.<\/li>\n<li>Agents that thrash: Tool use logs grow, but you never label failure modes. The planner gets indecisive, tries more tools, blows latency SLOs.<\/li>\n<\/ul>\n<p>Why it happens in real systems:<\/p>\n<ul>\n<li>Misaligned objective: Your system is judged on grounded task success, but you optimize recall@k or training loss.<\/li>\n<li>Uncurated heterogeneity: Conflicting versions of the same policy doc, regional variations, stale content. More data increases contradictions.<\/li>\n<li>Retrieval saturation: ANN indices and naive chunking turn your relevant set into a soup. More tokens in, lower attention density on the right facts.<\/li>\n<li>No eval slices: You measure one global metric. The long tail fails silently and grows with data volume.<\/li>\n<\/ul>\n<p>What teams misunderstand:<\/p>\n<ul>\n<li>Volume is not coverage. Without routing and slicing, you increase collision rates between similar-but-not-equivalent knowledge.<\/li>\n<li>Context is a scarce resource. Every extra token carries an opportunity cost.<\/li>\n<li>Indexes are alive. Cardinality changes alter recall\/precision trade-offs and require re-tuning, not just re-indexing.<\/li>\n<\/ul>\n<h2>Technical deep dive: where the architecture bites you<\/h2>\n<h3>RAG pipelines<\/h3>\n<ul>\n<li>Chunking and overlap\n<ul>\n<li>Symptom: High recall, low answer consistency.<\/li>\n<li>Cause: Token-based chunking ignores document structure. Overlap bleeds context and creates near-duplicate embeddings.<\/li>\n<li>Effect: ANN brings back multiple almost-identical chunks. Reranker has little discriminative signal.<\/li>\n<\/ul>\n<\/li>\n<li>ANN index behavior under growth\n<ul>\n<li>HNSW\/Milvus\/FAISS IVF parameters tuned for 5M vectors break at 50M. You get lower recall, higher query time, and a different recall-latency curve. Teams rarely retune efSearch, nprobe, or PQ config when cardinality changes.<\/li>\n<\/ul>\n<\/li>\n<li>Cross-encoder rerankers\n<ul>\n<li>Missing or misconfigured. When k grows from 5 to 50, reranking is no longer optional. Without it, precision@5 craters.<\/li>\n<\/ul>\n<\/li>\n<li>Context window dilution\n<ul>\n<li>Stuffing the top 20 chunks looks helpful. In practice, the LLM spreads attention across redundant snippets and misses the key constraint in chunk 3.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Fine-tuning and instruction data<\/h3>\n<ul>\n<li>Synthetic feedback loops\n<ul>\n<li>You generate synthetic Q&amp;A from your own model or a sibling. Entropy collapses. The model learns your templated style and gets worse on messy real queries.<\/li>\n<\/ul>\n<\/li>\n<li>Conflicting supervision\n<ul>\n<li>Multiple labelers or policy versions. The model learns to hedge or produce averaged answers. Users hate averaged answers.<\/li>\n<\/ul>\n<\/li>\n<li>Domain routing vs monolith\n<ul>\n<li>One big model trained on everything underperforms two smaller models with clear domain boundaries and routers.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Tool-use and agents<\/h3>\n<ul>\n<li>More logs, same blind spots\n<ul>\n<li>Without labeled failure traces (wrong tool, wrong param, wrong stop condition), adding logs only teaches the planner to try more things.<\/li>\n<\/ul>\n<\/li>\n<li>Latency cascade\n<ul>\n<li>More knowledge sources means more tool calls. If you don\u2019t add a budgeted planner with early-exit criteria, P95 latency doubles.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Practical fixes that usually beat adding data<\/h2>\n<h3>1) Build a ruthless eval harness before touching the corpus<\/h3>\n<ul>\n<li>Define 30 to 200 gold tasks that mirror real tickets. Include hard negatives and ambiguous queries.<\/li>\n<li>Slice by scenario: product area, region, policy version, query type. Track per-slice precision, groundedness, and latency.<\/li>\n<li>Add regression gates in CI: if precision@5 on \u201cpricing-EMEA\u201d drops 3 points, the deploy blocks.<\/li>\n<\/ul>\n<h3>2) Clean and structure your knowledge, not just expand it<\/h3>\n<ul>\n<li>Canonicalization\n<ul>\n<li>Pick a source of truth per topic. Mark older versions as deprecated, exclude from retrieval, or downweight via metadata.<\/li>\n<\/ul>\n<\/li>\n<li>Dedup and near-dedup\n<ul>\n<li>Use MinHash or SimHash to collapse 80 to 95 percent near-duplicates into a single canonical chunk with pointers.<\/li>\n<\/ul>\n<\/li>\n<li>Structure-aware chunking\n<ul>\n<li>Split by headings, sections, and list boundaries instead of naive token windows. Keep overlap minimal and purposeful. Aim for 300 to 600 tokens per chunk for most LLMs.<\/li>\n<\/ul>\n<\/li>\n<li>Metadata-first retrieval\n<ul>\n<li>Add hard filters (version, region, product) before vector search. Hybrid search (BM25 + vectors) plus a cross-encoder reranker beats 10x more vectors every time.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>3) Optimize the retrieval stack<\/h3>\n<ul>\n<li>Re-tune the index when cardinality changes\n<ul>\n<li>For FAISS IVF: revisit nlist, nprobe. For HNSW: efConstruction and efSearch should scale with corpus size.<\/li>\n<\/ul>\n<\/li>\n<li>Rerank aggressively\n<ul>\n<li>Use a small cross-encoder reranker on top 50 candidates. It often gives a 5 to 15 point precision@5 lift at a tiny fraction of LLM cost.<\/li>\n<\/ul>\n<\/li>\n<li>Query rewriting\n<ul>\n<li>Normalize user queries with term expansion and entity linking. Better queries are cheaper than bigger corpora.<\/li>\n<\/ul>\n<\/li>\n<li>Budget the context\n<ul>\n<li>Cap total retrieved tokens. Promote diversity over redundancy. If 3 chunks are near-identical, keep 1.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>4) For fine-tuning, fix supervision before adding examples<\/h3>\n<ul>\n<li>Gold set hygiene\n<ul>\n<li>One policy version. One style guide. Annotate with rationales. Throw out ambiguous items.<\/li>\n<\/ul>\n<\/li>\n<li>Curriculum and weighting\n<ul>\n<li>Upweight rare but critical scenarios. Don\u2019t let common easy tasks dominate.<\/li>\n<\/ul>\n<\/li>\n<li>Cap synthetic data ratio\n<ul>\n<li>Keep synthetic under 20 to 30 percent unless you can prove gains on real slices. Use a different model family to generate synthetic to avoid copying your own biases.<\/li>\n<\/ul>\n<\/li>\n<li>Route, don\u2019t bloat\n<ul>\n<li>Use a lightweight router with small, domain-specialized models. Latency and accuracy often beat a single bigger fine-tune.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>5) Introduce budgeted planning for agents<\/h3>\n<ul>\n<li>Set max tool calls and an early-exit policy based on confidence and coverage.<\/li>\n<li>Penalize redundant tools in the planner reward. Cache intermediate results.<\/li>\n<\/ul>\n<h3>6) Observability and feedback<\/h3>\n<ul>\n<li>Log retrieval hits with document IDs, versions, and confidence. Tie every answer to the evidence used.<\/li>\n<li>Collect user corrections with minimal friction. Feed only high-quality corrections back, with review.<\/li>\n<\/ul>\n<h2>Business impact you can actually feel<\/h2>\n<ul>\n<li>Cost\n<ul>\n<li>Dumping 10x more documents into a vector DB grows storage and ANN memory linearly, and often increases query cost 1.5 to 3x if you chase lost recall with higher efSearch or nprobe. On the LLM side, context bloat adds direct token costs. I\u2019ve seen teams pay 40 percent more per request with no accuracy gain.<\/li>\n<\/ul>\n<\/li>\n<li>Latency\n<ul>\n<li>More candidates and tools stretch P95. Add 200 to 400 ms for reranking, plus 300 to 800 ms for extra retrieval passes if you\u2019re not careful. Users feel that.<\/li>\n<\/ul>\n<\/li>\n<li>Reliability\n<ul>\n<li>Contradictory docs make answers unstable across days. Ops teams get paged for \u201cmodel regressions\u201d that are really data drift.<\/li>\n<\/ul>\n<\/li>\n<li>Scale risk\n<ul>\n<li>The bigger the corpus, the higher the compliance surface. Without canonicalization and versioning, you risk outdated or non-compliant guidance surfacing to customers.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>If precision is bad, more data makes it noisier. Fix retrieval and structure first.<\/li>\n<li>Re-tune ANN and rerank once corpus size changes. Don\u2019t just re-index and hope.<\/li>\n<li>Treat context as a budget. Diversity beats redundancy.<\/li>\n<li>Supervision quality beats volume for fine-tuning. Route domains instead of building one giant model.<\/li>\n<li>Add eval slices and hard regression gates. Global metrics hide your real failures.<\/li>\n<li>Clean data saves more money than bigger clusters. Canonicalize, dedup, version.<\/li>\n<\/ul>\n<h2>If you\u2019re stuck<\/h2>\n<p>If this sounds familiar, you\u2019re not alone. I help teams cut through the data sprawl, get precision back, and reduce cost without endless re-architecting. If your metrics flattened after a data dump, or your vector DB bill is climbing faster than accuracy, this is the kind of problem I fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The common failure mode: \u201clet\u2019s just add more data\u201d I see this play out every quarter. Metrics flatten, users complain about wrong answers, latency creeps up. Someone proposes a fix&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[12],"tags":[17,26,15],"class_list":["post-111","post","type-post","status-publish","format-standard","hentry","category-ai-failures","tag-ai-cost","tag-ai-eval","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/111","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=111"}],"version-history":[{"count":0,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/111\/revisions"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=111"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=111"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=111"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}