{"id":67,"date":"2025-05-14T10:32:21","date_gmt":"2025-05-14T10:32:21","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/embedding-quality-over-model-choice\/"},"modified":"2026-04-09T23:26:46","modified_gmt":"2026-04-09T23:26:46","slug":"embedding-quality-over-model-choice","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/05\/14\/embedding-quality-over-model-choice\/","title":{"rendered":"Stop blaming the LLM: embedding quality beats model choice in RAG"},"content":{"rendered":"<h2>The uncomfortable pattern<\/h2>\n<p>I keep seeing teams swap GPT-X for GPT-Y, layer on prompt hacks, then wonder why answers are still off. The chat UI is polished. The model is expensive. The retrieval still surfaces a PDF cover page and a Jira footer. Users say it hallucinates. What\u2019s actually broken is the embedding layer and everything around it.<\/p>\n<p>If your retrieval is noisy or shallow, no model on top will save you. The LLM only reasons over what you feed it. Feed it the wrong context and you\u2019ll get confident nonsense.<\/p>\n<h2>Where this shows up and why<\/h2>\n<ul>\n<li>Enterprise RAG chat: queries return marketing pages instead of the policy appendix the auditor cares about.<\/li>\n<li>Search: close-enough synonyms drown out exact matches with critical numbers and IDs.<\/li>\n<li>Agents and tools: the planner picks the wrong function because semantically-similar names collapse.<\/li>\n<li>Clustering\/dedup: near-duplicates stay in, subtle variants drop out.<\/li>\n<\/ul>\n<p>Why it happens in real systems:<\/p>\n<ul>\n<li>Teams treat embeddings as a checkbox. Pick whatever the SDK defaults to, call it done.<\/li>\n<li>Chunking is arbitrary. 200 tokens everywhere, ignoring structure, tables, or code.<\/li>\n<li>Index settings are copy-paste. Cosine vs dot, normalization, HNSW params, quantization all left at defaults.<\/li>\n<li>No retrieval eval set. Everyone optimizes for vibe, not recall or ranking quality.<\/li>\n<li>Domain drift. Embeddings trained on web English get dropped into legal, biotech, or multi-lingual corpora without adaptation.<\/li>\n<\/ul>\n<p>The misunderstanding: people think \u201ca better LLM\u201d fixes retrieval errors. In practice, retrieval quality controls your ceiling; the LLM just decides how close you get to it.<\/p>\n<h2>Technical deep dive: retrieval is the gatekeeper<\/h2>\n<p>Think about the pipeline, not just the model:<\/p>\n<ul>\n<li>Representation: how you turn text, tables, code, and metadata into vectors and sparse signals<\/li>\n<li>Retrieval: how you search (dense, sparse, hybrid), ANN index, filters<\/li>\n<li>Ranking: rerankers, cross-encoders, thresholding<\/li>\n<li>Assembly: how you stitch context for the LLM with minimal noise<\/li>\n<\/ul>\n<h3>Representation details that matter<\/h3>\n<ul>\n<li>Model fit to domain: general embeddings often collapse on acronyms, SKUs, part numbers, and code tokens. If your queries contain IDs and numbers, you need hybrid sparse+dense or a model that was trained to respect exact terms.<\/li>\n<li>Dimension and precision: 1536-d float32 sounds safe until your index hits millions of chunks. Quantization (int8, PQ) can work, but test if it hurts nearest neighbors on your eval set. Sometimes 768-d well-tuned beats 1536-d sloppy.<\/li>\n<li>Normalization and similarity: cosine assumes unit vectors. If your vendor returns non-normalized vectors and you use dot product, your recall shifts. I\u2019ve seen 10+ point recall@5 swings from this alone.<\/li>\n<li>Multi-vector representations: late-interaction (e.g., colbert-like) can help with long docs and phrase-level matching, at the cost of index size and latency. Worth it when paragraphs matter more than whole-doc semantics.<\/li>\n<li>Structure-aware chunking: split by headings, list items, code blocks, or table rows. Chunk windows with overlap are fine, but preserving boundaries avoids mixing policies with footers.<\/li>\n<\/ul>\n<h3>Retrieval and ranking trade-offs<\/h3>\n<ul>\n<li>Dense only: great for semantics, weak on exact numbers and rare tokens. Fast and simple. Often overused.<\/li>\n<li>Sparse (BM25): great for exact IDs, dates, and rare terms. Poor on paraphrase.<\/li>\n<li>Hybrid: weighted sum or cascaded search. Usually the default for enterprise. Costs a bit more infra and tuning, saves you end-user pain.<\/li>\n<li>ANN index params: HNSW M\/efConstruction\/efSearch change the recall\/latency curve. Cranking efSearch from 64 to 200 may buy you 3\u20135 recall points with a 1.5\u20132x latency hit.<\/li>\n<li>Rerankers: cross-encoders on top-50 or top-100 can clean noise. Calibrate k. 10 is often too low if your index has duplicates and near-duplicates.<\/li>\n<\/ul>\n<h3>Failure modes that look like model bugs but aren\u2019t<\/h3>\n<ul>\n<li>Context stuffed with near-duplicates. The LLM votes with volume.<\/li>\n<li>Numeric mismatch. The right doc is at rank 22 because the embedding ignored \u201cSection 4.3.7\u201d while a fluffy overview sat at rank 2.<\/li>\n<li>Multi-lingual bleed. English query pulls Spanish docs because your embedding space is not language-aware or your top-k is too low.<\/li>\n<li>ACL leaks. Filter-first retrieval was not enforced; embeddings pulled private data into shared answers.<\/li>\n<\/ul>\n<h2>Practical fixes that work<\/h2>\n<h3>1) Build a retrieval eval set before you tune prompts<\/h3>\n<ul>\n<li>Collect 200\u2013500 real user queries, each with 1\u20133 gold passages and hard negatives.<\/li>\n<li>Track recall@k, nDCG@k, and MRR. If recall@20 is under 80% for your core workflows, your LLM won\u2019t save you.<\/li>\n<li>Re-run this suite anytime you change embedding models, chunking, index settings, or reranking.<\/li>\n<\/ul>\n<h3>2) Test multiple embedding models on your data, not theirs<\/h3>\n<ul>\n<li>Compare at fixed k, same chunking, same ANN params. Vendors\u2019 leaderboards rarely mirror your corpus.<\/li>\n<li>If two models are within 1\u20132 points, take the smaller, cheaper one. Bank the savings for reranking or hybrid search.<\/li>\n<li>Check failure slices: numerics, code, acronyms, multilingual, long-tail entities.<\/li>\n<\/ul>\n<h3>3) Use hybrid retrieval as the default<\/h3>\n<ul>\n<li>Dense + BM25 with calibrated weights or a two-stage cascade.<\/li>\n<li>Pin exact terms: if the query has an ID, enforce a sparse top result unless the dense score is extremely confident.<\/li>\n<li>Add time decay when recency matters. Store and filter by timestamps before vector search when you can.<\/li>\n<\/ul>\n<h3>4) Chunk with intent, not with a ruler<\/h3>\n<ul>\n<li>Respect document structure. Use heading-aware or semantic splitters. For code, split by functions or AST nodes. For tables, row-level or section-level chunks.<\/li>\n<li>Keep overlap small and meaningful. 10\u201320% is usually enough.<\/li>\n<li>Store parent-child links so rerankers and the LLM can pull context from siblings.<\/li>\n<\/ul>\n<h3>5) Rerank aggressively but cheaply<\/h3>\n<ul>\n<li>Cross-encode top-50 or top-100. It is often the highest ROI spend after hybrid.<\/li>\n<li>Calibrate a minimum relevance threshold. If nothing clears it, say \u201cno strong match\u201d and ask a follow-up. Hallucinations drop.<\/li>\n<\/ul>\n<h3>6) Light domain adaptation beats heavy fine-tunes<\/h3>\n<ul>\n<li>Contrastive tuning with 5\u201320k positive\/negative pairs can give you 5\u201315 recall points. <\/li>\n<li>Keep a generalist fallback index for OOD queries to avoid overfitting.<\/li>\n<li>Retrain on a schedule using click and citation feedback, but gate it behind your eval.<\/li>\n<\/ul>\n<h3>7) Get the index hygiene right<\/h3>\n<ul>\n<li>Dedupe with MinHash or cosine thresholding before indexing. Reduce clutter.<\/li>\n<li>Normalize units and terms upfront. \u201cMbps\u201d vs \u201cMb\/s\u201d should not split.<\/li>\n<li>Watch vector norms and distribution shifts. Sudden drift signals upstream parsing or model changes.<\/li>\n<\/ul>\n<h3>8) Engineer for latency and cost without killing recall<\/h3>\n<ul>\n<li>Tune HNSW efSearch to the knee of your curve. Start with 64\u2013128 and measure.<\/li>\n<li>Quantize only after you know your recall hit. Try int8 product quantization and compare nDCG.<\/li>\n<li>Cache top queries and their top-k ids. Warm frequently-hit tenants separately.<\/li>\n<li>If you must cut tokens, improve retrieval so you can feed fewer but higher-quality chunks.<\/li>\n<\/ul>\n<h2>Business impact you can actually feel<\/h2>\n<ul>\n<li>One enterprise knowledge bot: moving from a generic embedding + dense-only to a tuned hybrid + cross-encoder improved top-3 recall from 58% to 86%. Answer accuracy went up 12 points. We then downsized the LLM tier and cut latency by 35% while saving ~40% on monthly inference.<\/li>\n<li>A developer support assistant: structure-aware chunking for code and API docs plus numeric-aware reranking halved hallucination tickets without touching the chat model.<\/li>\n<\/ul>\n<p>When retrieval improves, you unlock cheaper models, smaller contexts, and fewer retries. The inverse is not true. A stronger LLM on top of weak retrieval just burns cash more confidently.<\/p>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Retrieval sets the ceiling. The LLM just operates within it.<\/li>\n<li>Measure retrieval with a real eval set. Optimize recall@k and nDCG before prompt tinkering.<\/li>\n<li>Hybrid search is the default for enterprise data. Dense-only is a gamble.<\/li>\n<li>Chunk by structure and intent, not fixed length.<\/li>\n<li>Rerank. It is the cheapest accuracy gain after fixing chunking.<\/li>\n<li>Light domain adaptation of embeddings often beats buying a bigger chat model.<\/li>\n<li>Tune your index. Similarity choice, normalization, and ANN params are not defaults you can ignore.<\/li>\n<\/ul>\n<h2>If this resonates<\/h2>\n<p>If you\u2019re seeing stubborn errors despite fancy prompts and bigger models, it is probably your embeddings and retrieval stack. This is the kind of thing I help teams fix when systems start failing under real usage. Happy to take a look at your eval set and point out the first three changes that will move the needle.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The uncomfortable pattern I keep seeing teams swap GPT-X for GPT-Y, layer on prompt hacks, then wonder why answers are still off. The chat UI is polished. The model is&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3],"tags":[26,15,29],"class_list":["post-67","post","type-post","status-publish","format-standard","hentry","category-ai-architecture","tag-ai-eval","tag-ai-system-design","tag-embeddings"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/67","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=67"}],"version-history":[{"count":1,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/67\/revisions"}],"predecessor-version":[{"id":85,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/67\/revisions\/85"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=67"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=67"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=67"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}