When AI Is The Wrong Solution (And What To Do Instead)

Posted by

The uncomfortable truth: a lot of AI is busywork in disguise

If you can write the spec, you probably do not need an LLM. I keep seeing teams ship chatbots for problems that should have been solved with a form, a rules engine, or a better search bar. The project looks exciting for a quarter, then quietly accrues latency, cost, support tickets, and a graveyard of prompts.

This is not anti‑AI. I build and run AI systems. I also turn them off when they are the wrong tool.

Where this goes wrong and why

You will see the pattern in a few places:

  • Customer support deflection that needs 99.5% precision, but the data is messy and the answers must be exact
  • Data extraction on well-structured PDFs where regex plus a parser would be cheaper and more reliable
  • Internal search across a few hundred docs where BM25 with filters beats a full RAG stack
  • Workflow automation where a state machine is the core value and the LLM becomes a flaky decision-maker in the control plane
  • Personalization that is basically templating and targeting, not a generative problem

Why it happens in real systems:

  • Novelty bias. The team wants to do AI work, not “boring” plumbing
  • Vendor gravity. A platform demo looked magical, then reality hits your data and SLAs
  • Vague objectives. “Improve CX with AI” is not a requirement
  • No ground truth. You cannot evaluate, so “seems good” passes for a while
  • Misplaced abstraction. Treating open-ended generation like a deterministic decision engine

What most teams misunderstand:

  • Accuracy is not the only constraint. Latency budgets, tail reliability, and unit economics matter more in production
  • RAG does not fix unclear sources of truth or bad content hygiene
  • Fine-tuning does not remove the need for evaluation, guardrails, and deterministic backstops
  • “Agentic” does not mean “safe to let it click buttons in prod”

Technical deep dive: architecture, trade-offs, failures

Think at the system boundary:

  • Deterministic control planes vs probabilistic decision-makers
    • If the decision is binary, high-stakes, and testable, use code or a rules engine in the control path. Keep generative models on the side-path for suggestions, summaries, or UI polish
  • Latency and tail risk
    • LLMs introduce network calls, token streaming, sometimes multi-hop tool use. P99 drifts upward. If your SLO is 200 ms end-to-end, you are likely out of budget
  • Context and retrieval complexity
    • RAG adds ingestion pipelines, chunking, embeddings, vector store ops, ACL-aware retrieval, and continuous reindexing. Each is another place to break and another dashboard to own
  • Evaluation blind spots
    • Many teams run subjective evals in staging, then ship. Without a steady flow of labeled outcomes and task-specific metrics, drift and regressions are invisible until customers complain
  • Failure modes I see most
    • Hallucinated actions or citations
    • Truncation because of hidden token growth and prompt bloat
    • Retrieval mismatch due to bad chunking or embedding updates lagging the source of truth
    • Tool-call loops and retries that explode latency and cost
    • Vendor model changes that shift behavior overnight

What to do instead: concrete alternatives that work

Use this list as a mental shortcut before you spin up a new AI service.

1) Deterministic decisions with clear specs

  • Instead of: an LLM that decides refunds, limits, routing, or compliance checks
  • Use: a rules engine or DSL with versioned decision tables, unit tests, and feature flags
  • Why: you can prove correctness, ship fast, and roll back safely
  • Add AI: for operator suggestions or explanations, not the final decision

2) Structured extraction from predictable layouts

  • Instead of: few-shot extraction from consistent invoices or forms
  • Use: template classifiers + layout parser + regex or a lightweight grammar
  • Why: cheap, explainable, and testable with golden files
  • Add AI: only for long-tail variants, gated by confidence

3) Small-to-medium corpus search

  • Instead of: full RAG for a few hundred docs
  • Use: BM25 or hybrid lexical-semantic search with strict filters and curated snippets
  • Why: better precision, lower ops overhead, easier ACL handling
  • Add AI: to rewrite the user query or summarize the top k results

4) Workflow automation and integrations

  • Instead of: an agent to orchestrate APIs
  • Use: a typed state machine with retries, timeouts, and idempotency
  • Why: predictable failure handling and observability
  • Add AI: to map unstructured input to typed intents before entering the state machine

5) Customer support flows

  • Instead of: a free-form chat that must be right
  • Use: guided flows, better content hygiene, and a strong search index
  • Why: high precision and lower deflection risk
  • Add AI: tone rewrite, summarization, or draft answers that require a human click to send

6) Personalization and copy generation

  • Instead of: bespoke LLM per segment
  • Use: templated copy + feature flags + bandits for exploration/exploitation
  • Why: faster iteration and clean metrics
  • Add AI: to synthesize variants offline, not during every request

7) Analytics and forecasting with limited signal

  • Instead of: complex models when data is sparse
  • Use: simple baselines and sanity checks, measure lift honestly
  • Why: many “AI wins” vanish when you factor seasonality and selection bias
  • Add AI: later, if you can prove incremental value over the baseline

A practical decision checklist

Use this before wiring an LLM into the hot path:

  • Can you write a deterministic spec and unit test the outcome? If yes, do that first
  • Is required precision above 99% and errors are expensive? Avoid generative in the control path
  • Is the latency budget under 300 ms P99? Be skeptical
  • Do you have ground truth to evaluate against on an ongoing basis? If not, prioritize data and eval pipelines before models
  • Is the corpus small and well-structured? Try lexical or hybrid search before RAG
  • Will the prompt or context size grow over time? Expect your costs to drift upward nonlinearly
  • Can you separate suggest from decide? Keep AI on the suggest path with human or rules gating

Hybrid pattern I recommend

  • Low road: deterministic path with rules, search, and state machines
  • High road: AI sidecar that produces suggestions, summaries, or candidates with a confidence score
  • Gating: only promote AI output to users or systems if score and heuristics meet thresholds; otherwise fall back
  • Rollout: canary, measure, then widen. Keep a kill switch

Evaluation that actually constrains risk

  • Define task-specific metrics that correlate with business outcomes, not just BLEU-like proxies
  • Build labeled datasets from day one, even if small. Add hard negatives from real failures
  • Track per-slice metrics. The average hides tail pain
  • Run offline batch evals on every change to prompts, retrieval, or model versions
  • Post-deploy, sample production traffic for human review and close the loop to your training or rule updates

Cost and performance impact in real terms

A simple way to estimate unit economics before you commit:

  • Cost per request ≈ input_tokens × price_in + output_tokens × price_out + retrieval_cost + orchestration_overhead
  • Example back-of-napkin: 2k input tokens and 300 output tokens on a mid-tier model at $2 per 1M input tokens and $8 per 1M output tokens is roughly 2k×$0.000002 + 300×$0.000008 ≈ $0.006 per call, before retrieval and retries. At 10 RPS sustained, that is about $1.5k per month just for the model, which can double with RAG and retries
  • Tail latency compounds with tool calls. Each hop adds P99 spread. Your SLO must reflect the real chain, not the median of one call
  • Scaling risk: context growth and prompt creep raise cost over time. Vector indexes grow and reindexing becomes a line item. Vendor model changes can shift your quality overnight

If a deterministic alternative solves 90% of the problem at 10% of the cost and you can ship it in 2 weeks, ship it. You can still layer AI later where it moves the needle.

Quick examples from the field

  • Replaced a RAG chatbot for internal IT with a curated KB, BM25, and a form. Tickets auto-resolved went from 18% to 41%. Latency dropped from 2.1 s P95 to 140 ms. Costs became negligible
  • Swapped an agent-based billing workflow for a typed state machine with a small intent classifier. Incident rate dropped 70%. Mean time to recover improved because failures were explicit, not stochastic
  • For invoice extraction, regex plus layout parsing handled 92% of docs. An AI fallback covered the remaining 8% with human-in-the-loop. Overall accuracy hit target without putting a model in the control path

Key takeaways

  • If you can test it, code it. Keep AI out of the control plane unless you can tolerate and detect errors
  • Start with search, rules, and state machines. Add AI as a sidecar, not the engine
  • Invest in evaluation and data before you scale models. Without it, you are flying blind
  • Model unit economics drift over time. Watch context growth and retries, not just vendor list prices
  • RAG is not a shortcut for bad content or unclear sources of truth
  • Confidence gating and fallbacks are not optional in production

If you are stuck

If you are staring at a flaky AI feature that kind of works in demos but not at scale, you likely have a product and architecture problem, not a prompt problem. I help teams untangle this, cut scope to what is deterministic, and reintroduce AI where it truly pays. If you are running into similar issues, this is exactly the kind of thing I fix when systems start breaking at scale.