When AI Is The Wrong Solution (And What To Do Instead)

The uncomfortable truth: a lot of AI is busywork in disguise

If you can write the spec, you probably do not need an LLM. I keep seeing teams ship chatbots for problems that should have been solved with a form, a rules engine, or a better search bar. The project looks exciting for a quarter, then quietly accrues latency, cost, support tickets, and a graveyard of prompts.

This is not anti‑AI. I build and run AI systems. I also turn them off when they are the wrong tool.

Where this goes wrong and why

You will see the pattern in a few places:

Customer support deflection that needs 99.5% precision, but the data is messy and the answers must be exact
Data extraction on well-structured PDFs where regex plus a parser would be cheaper and more reliable
Internal search across a few hundred docs where BM25 with filters beats a full RAG stack
Workflow automation where a state machine is the core value and the LLM becomes a flaky decision-maker in the control plane
Personalization that is basically templating and targeting, not a generative problem

Why it happens in real systems:

Novelty bias. The team wants to do AI work, not “boring” plumbing
Vendor gravity. A platform demo looked magical, then reality hits your data and SLAs
Vague objectives. “Improve CX with AI” is not a requirement
No ground truth. You cannot evaluate, so “seems good” passes for a while
Misplaced abstraction. Treating open-ended generation like a deterministic decision engine

What most teams misunderstand:

Accuracy is not the only constraint. Latency budgets, tail reliability, and unit economics matter more in production
RAG does not fix unclear sources of truth or bad content hygiene
Fine-tuning does not remove the need for evaluation, guardrails, and deterministic backstops
“Agentic” does not mean “safe to let it click buttons in prod”

Technical deep dive: architecture, trade-offs, failures

Think at the system boundary:

Deterministic control planes vs probabilistic decision-makers
- If the decision is binary, high-stakes, and testable, use code or a rules engine in the control path. Keep generative models on the side-path for suggestions, summaries, or UI polish
Latency and tail risk
- LLMs introduce network calls, token streaming, sometimes multi-hop tool use. P99 drifts upward. If your SLO is 200 ms end-to-end, you are likely out of budget
Context and retrieval complexity
- RAG adds ingestion pipelines, chunking, embeddings, vector store ops, ACL-aware retrieval, and continuous reindexing. Each is another place to break and another dashboard to own
Evaluation blind spots
- Many teams run subjective evals in staging, then ship. Without a steady flow of labeled outcomes and task-specific metrics, drift and regressions are invisible until customers complain
Failure modes I see most
- Hallucinated actions or citations
- Truncation because of hidden token growth and prompt bloat
- Retrieval mismatch due to bad chunking or embedding updates lagging the source of truth
- Tool-call loops and retries that explode latency and cost
- Vendor model changes that shift behavior overnight

What to do instead: concrete alternatives that work

Use this list as a mental shortcut before you spin up a new AI service.

1) Deterministic decisions with clear specs

Instead of: an LLM that decides refunds, limits, routing, or compliance checks
Use: a rules engine or DSL with versioned decision tables, unit tests, and feature flags
Why: you can prove correctness, ship fast, and roll back safely
Add AI: for operator suggestions or explanations, not the final decision

2) Structured extraction from predictable layouts

Instead of: few-shot extraction from consistent invoices or forms
Use: template classifiers + layout parser + regex or a lightweight grammar
Why: cheap, explainable, and testable with golden files
Add AI: only for long-tail variants, gated by confidence

3) Small-to-medium corpus search

Instead of: full RAG for a few hundred docs
Use: BM25 or hybrid lexical-semantic search with strict filters and curated snippets
Why: better precision, lower ops overhead, easier ACL handling
Add AI: to rewrite the user query or summarize the top k results

4) Workflow automation and integrations

Instead of: an agent to orchestrate APIs
Use: a typed state machine with retries, timeouts, and idempotency
Why: predictable failure handling and observability
Add AI: to map unstructured input to typed intents before entering the state machine

5) Customer support flows

Instead of: a free-form chat that must be right
Use: guided flows, better content hygiene, and a strong search index
Why: high precision and lower deflection risk
Add AI: tone rewrite, summarization, or draft answers that require a human click to send

6) Personalization and copy generation

Instead of: bespoke LLM per segment
Use: templated copy + feature flags + bandits for exploration/exploitation
Why: faster iteration and clean metrics
Add AI: to synthesize variants offline, not during every request

7) Analytics and forecasting with limited signal

Instead of: complex models when data is sparse
Use: simple baselines and sanity checks, measure lift honestly
Why: many “AI wins” vanish when you factor seasonality and selection bias
Add AI: later, if you can prove incremental value over the baseline

A practical decision checklist

Use this before wiring an LLM into the hot path:

Can you write a deterministic spec and unit test the outcome? If yes, do that first
Is required precision above 99% and errors are expensive? Avoid generative in the control path
Is the latency budget under 300 ms P99? Be skeptical
Do you have ground truth to evaluate against on an ongoing basis? If not, prioritize data and eval pipelines before models
Is the corpus small and well-structured? Try lexical or hybrid search before RAG
Will the prompt or context size grow over time? Expect your costs to drift upward nonlinearly
Can you separate suggest from decide? Keep AI on the suggest path with human or rules gating

Hybrid pattern I recommend

Low road: deterministic path with rules, search, and state machines
High road: AI sidecar that produces suggestions, summaries, or candidates with a confidence score
Gating: only promote AI output to users or systems if score and heuristics meet thresholds; otherwise fall back
Rollout: canary, measure, then widen. Keep a kill switch

Evaluation that actually constrains risk

Define task-specific metrics that correlate with business outcomes, not just BLEU-like proxies
Build labeled datasets from day one, even if small. Add hard negatives from real failures
Track per-slice metrics. The average hides tail pain
Run offline batch evals on every change to prompts, retrieval, or model versions
Post-deploy, sample production traffic for human review and close the loop to your training or rule updates

Cost and performance impact in real terms

A simple way to estimate unit economics before you commit:

Cost per request ≈ input_tokens × price_in + output_tokens × price_out + retrieval_cost + orchestration_overhead
Example back-of-napkin: 2k input tokens and 300 output tokens on a mid-tier model at $2 per 1M input tokens and $8 per 1M output tokens is roughly 2k×$0.000002 + 300×$0.000008 ≈ $0.006 per call, before retrieval and retries. At 10 RPS sustained, that is about $1.5k per month just for the model, which can double with RAG and retries
Tail latency compounds with tool calls. Each hop adds P99 spread. Your SLO must reflect the real chain, not the median of one call
Scaling risk: context growth and prompt creep raise cost over time. Vector indexes grow and reindexing becomes a line item. Vendor model changes can shift your quality overnight

If a deterministic alternative solves 90% of the problem at 10% of the cost and you can ship it in 2 weeks, ship it. You can still layer AI later where it moves the needle.

Quick examples from the field

Replaced a RAG chatbot for internal IT with a curated KB, BM25, and a form. Tickets auto-resolved went from 18% to 41%. Latency dropped from 2.1 s P95 to 140 ms. Costs became negligible
Swapped an agent-based billing workflow for a typed state machine with a small intent classifier. Incident rate dropped 70%. Mean time to recover improved because failures were explicit, not stochastic
For invoice extraction, regex plus layout parsing handled 92% of docs. An AI fallback covered the remaining 8% with human-in-the-loop. Overall accuracy hit target without putting a model in the control path

Key takeaways

If you can test it, code it. Keep AI out of the control plane unless you can tolerate and detect errors
Start with search, rules, and state machines. Add AI as a sidecar, not the engine
Invest in evaluation and data before you scale models. Without it, you are flying blind
Model unit economics drift over time. Watch context growth and retries, not just vendor list prices
RAG is not a shortcut for bad content or unclear sources of truth
Confidence gating and fallbacks are not optional in production

If you are stuck

If you are staring at a flaky AI feature that kind of works in demos but not at scale, you likely have a product and architecture problem, not a prompt problem. I help teams untangle this, cut scope to what is deterministic, and reintroduce AI where it truly pays. If you are running into similar issues, this is exactly the kind of thing I fix when systems start breaking at scale.

Architect's Brief

When AI Is The Wrong Solution (And What To Do Instead)

The uncomfortable truth: a lot of AI is busywork in disguise

Where this goes wrong and why

Technical deep dive: architecture, trade-offs, failures

What to do instead: concrete alternatives that work

1) Deterministic decisions with clear specs

2) Structured extraction from predictable layouts

3) Small-to-medium corpus search

4) Workflow automation and integrations

5) Customer support flows

6) Personalization and copy generation

7) Analytics and forecasting with limited signal

A practical decision checklist

Hybrid pattern I recommend

Evaluation that actually constrains risk

Cost and performance impact in real terms

Quick examples from the field

Key takeaways

If you are stuck

Category Name

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Recent Posts

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

AI Observability: Stop Guessing, Start Instrumenting

Categories

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS

Why Most RAG Architectures Break Under Real User Load

Why Your RAG System Retrieves the Wrong Data (and How to Fix It)

The real cost breakdown of running LLM apps on AWS