{"id":35,"date":"2025-10-11T10:27:03","date_gmt":"2025-10-11T10:27:03","guid":{"rendered":"https:\/\/angirash.in\/blog\/2025\/07\/14\/ai-build-vs-buy-decision-framework\/"},"modified":"2026-04-10T19:29:07","modified_gmt":"2026-04-10T19:29:07","slug":"ai-build-vs-buy-decision-framework","status":"publish","type":"post","link":"https:\/\/angirash.in\/blog\/2025\/10\/11\/ai-build-vs-buy-decision-framework\/","title":{"rendered":"Build vs Buy in AI: A Real Decision Framework That Holds Up in Production"},"content":{"rendered":"<h2>The honest problem<\/h2>\n<p><a href=\"https:\/\/angirash.in\/blog\/2025\/09\/09\/why-most-enterprise-ai-pilots-fail\/\">Most AI teams waste quarters arguing about build vs buy<\/a>, then end up doing both in the worst way: they buy a black-box API and still build a half-baked orchestration layer around it. Costs creep. Latency creeps. Then a product leader asks for an on-prem install and the whole thing catches fire.<\/p>\n<p>I have walked into too many reviews where the system diagram looks clean, but the risks sit in the white space. The decision to build or buy is not one big call. It is a series of calls across the stack, each with different economics and failure modes.<\/p>\n<p>This post is the framework I actually use with teams, with the trade-offs you only learn after running something at scale.<\/p>\n<h2>Where teams get stuck and why<\/h2>\n<ul>\n<li>Where it shows up\n<ul>\n<li>Internal Q&amp;A over company docs<\/li>\n<li>AI copilots inside core workflows<\/li>\n<li>Domain-heavy reasoning or compliance-sensitive summarization<\/li>\n<li>Multi-tenant RAG platforms offered to customers<\/li>\n<li>Agents and tool-use with tight latency budgets<\/li>\n<\/ul>\n<\/li>\n<li>Why it happens in real systems\n<ul>\n<li>Model pricing and throughput are moving targets. What was cheap last quarter becomes a bottleneck after a product hit.<\/li>\n<li>Data gravity. The moment you ingest, clean, chunk, and index, you are locked into shapes that are expensive to undo.<\/li>\n<li>SLOs, not accuracy, drive architecture. A 95th percentile latency of 800 ms drives very different choices than 2.5 s.<\/li>\n<li>Vendors deprecate models, change safety defaults, or throttle per tenant. Your roadmap does not care.<\/li>\n<\/ul>\n<\/li>\n<li>What most teams misunderstand\n<ul>\n<li>Models are not the only lever. Routing, caching, and retrieval quality often beat switching the base model.<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/03\/14\/why-your-ai-evaluation-metrics-are-misleading\/\">You cannot buy your way out of eval and observability<\/a>. If you do not own evaluation, you do not own quality.<\/li>\n<li>Procurement time and data agreements can dwarf engineering time. Plan for it or you will miss the window.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>The stack, the choices, the failure modes<\/h2>\n<p>Think in layers. Decide per layer, not per project.<\/p>\n<ul>\n<li>Application UX and product logic\n<ul>\n<li>Build. It is your differentiation.<\/li>\n<\/ul>\n<\/li>\n<li>Orchestration and tool routing\n<ul>\n<li>Early: Buy a thin framework or use a lightweight open library.<\/li>\n<li>Scale: Build minimal, vendor-agnostic adapters. Failure mode: coupling prompts and tool schemas to a single provider.<\/li>\n<\/ul>\n<\/li>\n<li>Reasoning model (LLM or mixture)\n<ul>\n<li>Start: Buy via API. Swap freely. Failure mode: pinning to one vendor-specific feature that saves 100 ms but traps you later.<\/li>\n<li>Scale or tight control: Host one mid-size model for the 60 to 80 percent traffic band, fall back to a premium API for hard cases.<\/li>\n<\/ul>\n<\/li>\n<li>Retrieval and embeddings\n<ul>\n<li>Buy the vector DB if you do not have ops maturity. Build your ingestion, chunking, and metadata policies.<\/li>\n<li>Failure mode: indexing everything with generic chunking, then paying permanent latency and storage tax.<\/li>\n<\/ul>\n<\/li>\n<li>Fine-tuning and adapters\n<ul>\n<li>Do not start here. <a href=\"https:\/\/angirash.in\/blog\/2025\/05\/08\/when-not-to-use-rag\/\">Use RAG and prompt shaping until gains flatten<\/a>. Failure mode: fine-tuning to patch data cleanliness issues.<\/li>\n<\/ul>\n<\/li>\n<li>Safety, red teaming, and policy enforcement\n<ul>\n<li>Buy policy libraries and classifiers. Build policy config and overrides. Failure mode: delegating policy to a vendor and learning about it in production incidents.<\/li>\n<\/ul>\n<\/li>\n<li>Observability and evaluation\n<ul>\n<li>Own the test sets, metrics, and approval gates. You can buy tooling, but you must own the data and thresholds.<\/li>\n<\/ul>\n<\/li>\n<li>Serving infrastructure\n<ul>\n<li>Under 5 rps and flexible SLOs: buy.<\/li>\n<li>Over 20 rps or sub-second targets: <a href=\"https:\/\/angirash.in\/blog\/2025\/07\/18\/gpu-vs-cpu-ai-cost-performance-tradeoffs\/\">consider hosting at least one model tier<\/a>. Failure mode: paying per-token premium for predictable workloads.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Latency and cost are architectural, not configuration<\/h3>\n<ul>\n<li>Latency budget example for a RAG answer\n<ul>\n<li>Query rewrite: 80 to 150 ms<\/li>\n<li>Retrieval: 60 to 200 ms (network + ANN)<\/li>\n<li>Synthesis LLM: 400 to 1200 ms<\/li>\n<li>Tool calls or secondary prompts: 200 to 600 ms<\/li>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/04\/21\/llm-latency-in-production-what-actually-works\/\">You will not hit sub-second P95 without parallelism<\/a>, caching, or hosting at least one local model.<\/li>\n<\/ul>\n<\/li>\n<li>Cost shape you should compute on day one\n<ul>\n<li>Vendor API: $ per 1M tokens is predictable but high at scale. Great for bursty or uncertain workloads.<\/li>\n<li>Self-hosted effective cost per 1M tokens<\/li>\n<li>Formula: (GPU_hourly_cost) \/ (throughput_tokens_per_sec \u00d7 3600) \u00d7 1,000,000<\/li>\n<li>Example 8 to 13B model on a mid-range GPU: ~150 to 300 tok\/s. At $0.60 to $1.20 per GPU hour, you land near $0.55 to $1.50 per 1M tokens at good utilization. For a detailed breakdown of <a href=\"https:\/\/angirash.in\/blog\/2025\/08\/14\/true-cost-self-hosting-llms-vs-apis\/\">the true cost of self-hosting LLMs vs APIs<\/a>, see the companion post with throughput math and KV cache formulas.<\/li>\n<li>Example 70B class on high-end GPU: ~20 to 40 tok\/s. At $2.00 to $3.50 per hour, often $12 to $35 per 1M tokens, plus engineering time. Worth it only if you need control or volume is very high.<\/li>\n<li>The break-even is usually at steady, predictable traffic with caching in place.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>A decision framework that works<\/h2>\n<p>Score each layer on five dimensions from 1 to 5. Sum the row. Build if 15+, buy if under 10, hybrid in between. Keep it blunt.<\/p>\n<ul>\n<li>Differentiation potential. Will this layer create moat or just keep lights on?<\/li>\n<li>Data advantage. Do you have proprietary data that needs custom handling here?<\/li>\n<li>Control requirement. Latency SLO, privacy, model behavior guarantees.<\/li>\n<li>Scale predictability. Spiky traffic favors buy. Steady favors build.<\/li>\n<li>Compliance and risk. Regulated data, on-prem, data residency.<\/li>\n<\/ul>\n<p>Run the matrix per layer: model, retrieval, orchestration, eval, safety, serving. You will end up with a mixed answer. That is correct.<\/p>\n<h2>Practical patterns I recommend<\/h2>\n<h3>Pattern 1: Internal Q&amp;A over company docs<\/h3>\n<ul>\n<li>Buy: vector DB, base LLM via API, evaluation tooling.<\/li>\n<li>Build: ingestion pipeline, chunking policy, metadata schema, retrieval prompts, golden test set.<\/li>\n<li>Add now: response caching and prompt versioning.<\/li>\n<li>Watch for: permission leaks. Before deciding between build and buy, ask whether AI is even the right tool \u2014 understanding <a href=\"https:\/\/angirash.in\/blog\/2025\/03\/18\/when-ai-is-the-wrong-solution\/\">when AI is the wrong solution<\/a> helps you avoid over-engineering. Force retrieval to be tenant-scoped at index time, not request time only.<\/li>\n<\/ul>\n<h3>Pattern 2: Domain-heavy reasoning for core product<\/h3>\n<ul>\n<li>Start: single strong API model + strict retrieval + aggressive caching.<\/li>\n<li>When quality plateaus: try structured prompts and tool-use before fine-tune.<\/li>\n<li>At scale: host a mid-size model for easy queries, route 15 to 30 percent to premium API.<\/li>\n<li>Watch for: eval blindness. Maintain scenario-based test suites, not just aggregate scores.<\/li>\n<\/ul>\n<h3>Pattern 3: Low-latency copilot in an interactive UI<\/h3>\n<ul>\n<li>Target: P95 under 900 ms.<\/li>\n<li>Build: local embedding + retrieval, lightweight router, model cache.<\/li>\n<li>Buy: a fast small model or host an 8 to 13B yourself. Keep a premium fallback for hard prompts.<\/li>\n<li>Techniques: prefetch next-turn contexts, stream tokens, parallel tools, cache summaries.<\/li>\n<\/ul>\n<h2>Failure modes to avoid<\/h2>\n<ul>\n<li><a href=\"https:\/\/angirash.in\/blog\/2025\/06\/15\/common-mistakes-in-ai-architecture-design\/\">Vendor glue becomes your core<\/a>. If you cannot run your prompts and eval sets against two providers in 48 hours, you are locked in.<\/li>\n<li>Unbounded context. Tossing more context into prompts hides defects until cost and latency spike.<\/li>\n<li>No exit plan. Check data retention, egress fees, throttling policies, and model deprecation timelines before you sign.<\/li>\n<li>DIY eval too late. By the time customers complain, you have no baseline to compare fixes.<\/li>\n<\/ul>\n<h2>Business impact: numbers that matter<\/h2>\n<ul>\n<li>Cost\n<ul>\n<li>A steady 50 rps workload, 1,500 output tokens on average, is about 270M tokens per hour. If your self-hosted mid-size model does 250 tok\/s on a $1\/hour GPU, your effective compute is under $1 per 1M tokens. With a premium API at higher per-million pricing, routing even half the traffic locally can cut monthly cost by 30 to 60 percent.<\/li>\n<li>For low volume or bursty workloads, the opposite is true. The moment utilization drops, self-hosting loses its edge.<\/li>\n<\/ul>\n<\/li>\n<li>Performance\n<ul>\n<li>Cross-region vendor calls add 100 to 250 ms. Your RAG and tool calls add another 200 to 600 ms. Hosting retrieval and at least one model in-region is often the difference between a snappy copilot and a spinner.<\/li>\n<\/ul>\n<\/li>\n<li>Scaling risk\n<ul>\n<li>Rate limits and model deprecations are product risks. Design for hot-swapping models and backfilling evals on each swap.<\/li>\n<li>Multi-tenant indexes without hard isolation become compliance incidents. Put tenancy in the index, not only in the query filter.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>What to actually build vs buy on day one<\/h2>\n<ul>\n<li>Build now\n<ul>\n<li>Data ingestion, normalization, and chunking policy<\/li>\n<li>Prompt templates, tool schemas, routing logic<\/li>\n<li>Evaluation harness, test sets, regression gates<\/li>\n<li>Response caching and prompt versioning<\/li>\n<\/ul>\n<\/li>\n<li>Buy now\n<ul>\n<li>Base LLM access via API<\/li>\n<li>Vector database or a managed ANN service<\/li>\n<li>Safety classifiers and policy packs, if you lack in-house expertise<\/li>\n<\/ul>\n<\/li>\n<li>Revisit at scale\n<ul>\n<li>Host a mid-size model if traffic is steady and latency matters<\/li>\n<li>Consider fine-tuning only after retrieval and prompts plateau<\/li>\n<li>Move from single to multi-model routing as use cases split<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Key takeaways<\/h2>\n<ul>\n<li>Decide per layer. You will likely build some, buy some. That is healthy.<\/li>\n<li>Own the policies, prompts, evals, and data shapes. Rent the rest until the math flips.<\/li>\n<li>Your SLO and traffic shape pick your architecture more than accuracy does.<\/li>\n<li>Caching, routing, and retrieval quality beat model swapping in most cases.<\/li>\n<li>Have a 48-hour vendor exit drill. If you cannot switch, you do not own the system.<\/li>\n<\/ul>\n<h2>If you need a sounding board<\/h2>\n<p>If you are staring at a messy matrix and uncertain about the break-even, I have helped teams design for portability, hit latency targets, and cut monthly spend without losing quality. If you are running into similar issues, this is exactly the kind of thing I help teams fix when systems start breaking at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The honest problem Most AI teams waste quarters arguing about build vs buy, then end up doing both in the worst way: they buy a black-box API and still build&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10],"tags":[17,20,15],"class_list":["post-35","post","type-post","status-publish","format-standard","hentry","category-ai-strategy","tag-ai-cost","tag-ai-infra","tag-ai-system-design"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/35","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/comments?post=35"}],"version-history":[{"count":4,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/35\/revisions"}],"predecessor-version":[{"id":190,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/posts\/35\/revisions\/190"}],"wp:attachment":[{"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/media?parent=35"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/categories?post=35"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/angirash.in\/blog\/wp-json\/wp\/v2\/tags?post=35"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}