Versioning in LLM Systems: What Actually Matters in Production

Posted by

The quiet failure that burns teams

Most LLM incidents I get called into are not caused by GPUs catching fire or models forgetting how to English. They come from teams shipping a “minor update” and losing the ability to reproduce behavior. Support tickets spike, on-call scrambles, someone says “roll it back” and then discovers there is nothing concrete to roll back to except a vague idea of which prompt or model alias changed.

If you only version the model ID, you do not have a production system. You have a demo running on hope.

Where versioning breaks in real systems

  • RAG drift. Re-embedding a corpus with a new embedding model silently changes recall. A week later, finance data answers are off by just enough to be dangerous.
  • Prompt churn. A well-meaning PM edits a prompt in the CMS. Hit rate on tool calls drops. Nobody knows which users got which prompt.
  • Vendor alias roulette. Provider updates a model behind an alias. Your refusal rate jumps and your analytics blames “seasonality.”
  • Function schema mismatch. A new tool field is optional locally, required by the LLM. Calls fail only on long tail inputs. You cannot reproduce because traces don’t include the schema version.
  • Cache poisoning. Cache keys ignore top_p or system prompt. Old responses leak into new experiments and your offline eval looks great while prod drifts.

Why it happens:
– LLM stacks are graphs, not single models. You have prompts, tools, retrievers, indexes, safety, sampling, post-processors, caches, and data snapshots. Any piece moving changes behavior.
– Non-determinism hides causality. Small changes only show up statistically, which is easy to misread if you lack clean version boundaries.
– Vendors optimize for throughput and cost, not your reproducibility. Aliases change. Safety layers shift. You will not get a heads-up every time.

What many teams misunderstand:
– “Pinning model=foo-2024-11” is not versioning. Contract stability, data snapshots, and orchestration logic matter more than the base model name.
– Prompts and few-shots are code. If they live in Notion or a CMS without manifesting, you are shipping unversioned logic.
– RAG indexes are compiled artifacts. Treat them like build outputs with immutable IDs, not a “service” that is always the latest.

The versioning surface of an LLM system

Here is the mental model I use. Version the system as bundles, not as a single number. Each bundle is immutable, addressable, and shows up in logs on every request.

Capability contract version

  • What it is: Input and output schema for a capability. Example: answer_question takes user_text, tenant_id, and returns answer, citations[].
  • Why it matters: This is what you promise to other teams and to analytics. Break this and you get cascading failures.
  • Versioning: Semantic if you must, timestamped if you prefer. Every deployed route should include the contract version in its path or header.

Orchestration bundle

  • Components: Prompt template + few-shot exemplars + routing logic + tool selection policy + post-processing rules.
  • Include: Template text with exact variables, exemplar IDs, routing thresholds, stop sequences, JSON mode flags, logit biases.
  • Gotcha: People forget to version exemplars. A single edited exemplar can flip behavior.

Model bundle

  • Components: Provider, model ID, temperature, top_p, max_tokens, frequency/presence penalties, system prompt, seed if supported, safety settings, JSON schema constraints, stop tokens.
  • Include: Provider SDK version, region, retry policy, timeout settings.
  • Opinion: Never use provider aliases in prod. Always pin to a dated model ID, then shadow-test the alias separately.

Retrieval bundle

  • Components: Corpus snapshot ID, document set filters, chunking strategy, tokenizer version, chunk size and overlap, embedding model and parameters, index type and build params, re-ranker model version.
  • Include: Build job ID, timestamp, code commit that produced chunks, HNSW or PQ params, normalize options, dedupe rules.
  • Treat the corpus snapshot as immutable content-addressed storage in object store, with a manifest listing doc→chunk mapping.

Tool bundle

  • Components: Tool JSON Schema, function signatures, required/optional fields, timeouts, retries, client lib version, API version of the downstream system.
  • Include: Guard conditions for tool eligibility, rate limits, and backoff policies.

Guardrails and policy bundle

  • Components: Moderation model version, regex or parser rules, allowlists and denylists, PII redaction rules.
  • Include: Escalation policy for blocked outputs.

Evaluation suite

  • Components: Canonical test sets, adversarial cases, business-critical cohorts, metrics and thresholds.
  • Include: Random seed, sampling count, span-level definitions for hallucination scoring, cost accounting method.

Caching policy version

  • Components: Cache key fields, TTL, scope (per-tenant, global), serialization version.
  • Include: Whether retrieval snapshot ID and model bundle hash are in the key. If not, they will bite you.

Telemetry schema version

  • Components: What you log for each request. Must include all bundle IDs above and per-stage latencies.
  • Include: Privacy scrubber version, since PII handling often changes.

Failure modes and trade-offs

  • Granularity vs bureaucracy. If you version everything, the team will feel slowed down. If you do not, you will not be able to debug. My rule: version the boundaries where behavior changes the most and make everything else derivable from code commit IDs.
  • Determinism illusions. Setting temperature to zero does not make closed models deterministic. Treat evaluation as distributional. Gate on medians and quantiles, not single-run outputs.
  • Storage vs rebuild time. Keeping multiple index snapshots costs storage. Rebuilding on demand costs time and compute. If your corpus changes daily or the system is regulated, pay the storage bill.
  • Shadow vs canary cost. Shadowing doubles requests. Canarying risks some user traffic. Pick based on blast radius and your rollback confidence.

I have seen teams spend two days chasing a regression that turned out to be a change in chunk overlap from 64 to 32 in a helper library. Nobody recorded it. A 20 byte field in a manifest would have saved 16 engineer hours and a weekend.

What to actually implement

You do not need a platform rewrite. You need a manifest, some hashing, and discipline.

1. Ship a request-scoped manifest

  • For every user request, attach a manifest that lists the IDs of the bundles used: contract, orchestration, model, retrieval, tool, guardrail, cache policy, telemetry schema.
  • Compute a content hash over the normalized manifest. Store both the hash and the manifest JSON with the trace.
  • Add the manifest hash to cache keys. This prevents cross-talk between experiments and prod.

A manifest field list that consistently works:
– contract_id
– orchestration_id
– model_bundle_id
– retrieval_bundle_id
– tool_bundle_ids[]
– guardrail_bundle_id
– cache_policy_id
– telemetry_schema_id
– code_commit
– infra_region

2. Treat RAG as compiled artifacts

  • Write your chunker output and embeddings to an immutable path like s3://corpus/finance/2025-02-12T08-30Z/.
  • Persist a retrieval manifest with document list, chunking params, embedding model ID, index build params.
  • Store the exact embedding job logs with counts and any dropped docs. Add the retrieval bundle ID to every query trace.

3. Pin providers and schema

  • Use explicit model IDs. No aliases. Track provider SDK version.
  • For function calling, keep the tool schema in your repo and stamp it with a version in the manifest. Validate tool calls against the schema at runtime and log rejects.

4. Build evaluation around bundles

  • Your eval runner should take a manifest or a set of bundle IDs, not a loose set of flags.
  • Run distributional evals. At least N=20 samples per prompt for stochastic models when measuring semantic correctness. Track cost per test.
  • Gate merges on a small set of operational thresholds: retrieval recall, function-call success rate, refusal rate on safe inputs, P50 and P95 latency, and cost per task.

5. Roll out with real controls

  • Canary by capability contract, not by model. Roll 1 percent, then 10, then 50.
  • Watch these real-time signals per bundle hash: 4xx and 5xx tool errors, moderation block rate, token out vs in, top-k retrieval overlap, user override rate.
  • Keep a hard revert that flips traffic back to the last good manifest hash. Do not rely on redeploys for rollbacks.

6. Keep prompt work as data

  • Store prompts and exemplars in a versioned store with diff history. Git is fine. Even better is a small API that returns prompt id plus hash.
  • Each exemplar references a dataset row ID. Do not paste raw example text into the template without provenance.

7. Cache hygiene

  • Include model bundle ID, orchestration ID, and retrieval snapshot ID in cache keys. If you use post-processing, include that version too.
  • Different capabilities get different caches. Mixed caches are where subtle bugs hide.

8. Observability you will actually use

Log per request:
– manifest_hash and all bundle IDs
– user capability and tenant
– token counts in and out
– retrieval doc IDs and ranks
– tool calls with tool schema ID and success flag
– provider model ID actually served
– p50, p95 latencies per stage
– cost estimate breakdown

You will not look at half of this daily. You will need it at 2 AM.

Business impact you can explain to a CFO

  • Fewer outages and faster MTTR. With bundle-aware rollbacks, you stop the bleeding in minutes instead of burning a sprint.
  • Predictable costs. Offline evals tied to manifests catch accidental cost jumps from longer outputs or worse retrieval before they hit real traffic.
  • Lower support load. Silent regressions are expensive. Getting back to a known good state quickly avoids retesting entire user journeys.
  • Auditability. If you touch finance, healthcare, or legal, you will be asked to reproduce a historical answer. You either have the manifest and snapshots, or you do not have a story.
  • Velocity with safety. Teams move faster when experiments are isolated by design and cannot corrupt prod caches or indexes.

Rough numbers I have seen after putting the above in place:
– 30 to 60 percent reduction in incident time-to-root-cause on LLM regressions
– 10 to 20 percent reduction in inference spend due to cache isolation and early eval catches
– Two to three times faster model upgrades since shadow and canary flows are repeatable

Key takeaways

  • Version capabilities, not just models. Make the contract the top-level unit.
  • Use immutable, addressable bundles for orchestration, model, retrieval, tools, and guardrails. Hash them and log them.
  • Treat RAG indexes like compiled artifacts with manifests and snapshots.
  • Build evals that target a manifest, not ad hoc flags. Measure distributions, not single runs.
  • Include the manifest hash in cache keys and traces. This single step prevents a class of heisenbugs.
  • Roll out by bundle with hard revert switches. Do not trust provider aliases.

If this is painful, you are not alone

Most teams bolt this in after their first serious incident. If you want a second set of eyes on your manifest design, rollout gates, or how to snapshot RAG without ballooning storage, this is exactly the kind of thing I help teams fix when systems start breaking at scale.