You Don’t Have an ML Model Problem, You Have a Feedback Loop Problem

Table of Contents

Why this matters this week

Most “production ML” incidents I’ve heard about in the past month weren’t about model quality at all. They were about:

Silent regressions after a retrain
Feature pipelines drifting away from their “as-trained” definitions
Retrievers or ranking models getting more expensive than the value they create
Monitoring dashboards that look healthy while business metrics crater

Three themes keep repeating:

Evaluation happens offline, reality happens online.
Teams can recite their ROC-AUC but can’t answer “What’s the P95 error on revenue per session for users touched by the model in the last 72 hours?”
Data contracts are assumed, not enforced.
Feature pipeline changes (schema, distribution, semantics) roll out like any other data change—except your model silently overfits to last quarter’s behavior.
Cost and latency don’t get modeled as first-class metrics.
Many orgs bolt on vector search, larger models, and more features without a clear SLO or unit-economics model, then get surprised when infra spend doubles.

If you run ML in production, your real job is designing closed-loop systems: evaluation, monitoring, drift detection, and control of cost/latency. Models are just modules inside that loop.

What’s actually changed (not the press release)

Three concrete shifts over the last 12–18 months (especially with applied ML and LLM-based systems):

Live evaluation is now tractable, if you design for it.
- Logging complete inputs, model outputs, and outcomes is cheaper and more common.
- Feature stores and event streams make it possible to retroactively build labels and metrics.
- The bottleneck isn’t tech, it’s schema discipline: deciding what to log and how to join it.
Traffic patterns and user behavior are less stationary.
- LLM-augmented search, recommendations, and personalization change how users explore products.
- Your old “once-a-quarter retrain is fine” assumption is less valid because your ML system affects user distribution, which affects future data.
- Covariate shift and feedback loops are no longer edge cases; they’re the default.
Inference cost / latency trade-offs actually matter at scale.
- Embedding and retrieval for every request adds up.
- More complex feature pipelines move compute earlier in the request path.
- For many teams, retrieval + features + glue is more expensive than the model itself.

In other words, we’re not in a world where better offline benchmarks are the bottleneck; we’re in a world where end-to-end ML system design is.

How it works (simple mental model)

Here’s a minimal mental model I use when reviewing ML systems in production. Think of four loops:

Prediction loop (per-request path)
- Input → feature pipeline → model → post-processing → action
- Constraints: latency, availability, and cost per call
- Observable: prediction logs, feature values, model version, timing
Outcome loop (delayed labels)
- Action → user/environment response → business event (label)
- Example:
  - Credit model: prediction today, default or repayment over months
  - Recommendation: click/engagement within hours/days
- This loop defines the ground truth that determines if your model is useful.
Evaluation loop (offline + online)
- Join prediction logs with labels → compute metrics over cohorts and time windows
- Two key classes of metrics:
  - Model-centric: accuracy, ranking metrics, calibration, error by segment
  - System-centric: revenue, risk, throughput, tail latency, infra spend
- You need both. Offline model metrics alone are easy to game.
Adaptation loop (retraining + config changes)
- Decide when to retrain, when to roll back, when to ship new features
- Gate new models with:
  - Offline regression tests
  - Shadow/traffic-split tests
  - Guardrails on safety, cost, and key business metrics
- This loop ensures you don’t “optimize yourself off a cliff.”

Each loop has its own failure modes, and most outages come from poor interfaces between loops:

Predictions logged but not joinable with labels → no trustworthy evaluation
Feature pipelines evolving faster than models → prediction loop misaligned with evaluation data
Retrain jobs triggered on a cron instead of on data conditions → overfitting or stale models

If you can diagram these four loops for your system, you can reason about monitoring and drift.

Where teams get burned (failure modes + anti-patterns)

1. Offline heroics, online blindness

Anti-pattern:
– Huge energy on model selection, benchmarking, and hyperparameters
– Thin or missing online metrics and cohort breakdowns
– Incidents discovered via finance or support, not monitoring

Real-world example pattern:
– A marketplace deploys a new ranking model with +4% NDCG offline.
– Overall revenue goes up 1%, but a high-value seller segment sees -10% orders.
– The team has no per-segment online evaluation; they only notice after the quarterly business review.

What would have prevented it:
– A small set of guardrail metrics by segment (e.g., revenue and exposure by seller class) tied to the deployment pipeline.

2. “Feature store” without contracts

Anti-pattern:
– Central feature store, but schemas are soft suggestions, not contracts.
– Upstream analytics teams freely change semantics (e.g., “active_user” definition) without versioning.
– Training data is built from historical snapshots that don’t reflect current online transformations.

Real-world example pattern:
– A fraud model uses total_spend_30d.
– Finance changes the definition to exclude refunds and chargebacks for reporting reasons.
– No one re-trains the model; fraud scores shift, manual review queues spike.

What would have prevented it:
– Versioned feature definitions with CI checks that block changes unless:
– The owner acknowledges downstream model usage, or
– A new feature version is created and rolled out intentionally.

3. Silent data drift and feedback loops

Anti-pattern:
– Covariate drift detection is either missing or global only.
– No notion of policy-induced drift (model changes user behavior).
– Retrain schedules are calendar-based (“every X weeks”) rather than data-based.

Real-world example pattern:
– A recommendations model optimizes for short-term clicks.
– Over time, it surfaces more sensational content; user behavior shifts.
– Training data increasingly reflects the model’s own biases, so retrains make it worse.

What would have prevented it:
– Monitoring both:
– Input feature distributions vs. training baseline
– Response distribution and delayed outcomes over time
– Constraints that penalize short-term optimizations that destroy long-term metrics.

4. Ignoring cost and latency until the bill arrives

Anti-pattern:
– New retrieval layer, embeddings, and re-rankers introduced incrementally.
– No explicit latency budget or cost per 1k requests target.
– Scaling decisions made reactively after infra costs surpass some informal “too much.”

Real-world example pattern:
– A search team adds semantic search plus personalization features.
– P99 latency creeps from 150ms to 550ms; infra spend doubles.
– They only notice when SREs push back and leadership asks “why is infra up 2x?”

What would have prevented it:
– For each major ML component: defined SLOs and unit costs, and a simple spreadsheet model of “dollars per incremental metric gain.”

Practical playbook (what to do in the next 7 days)

You can’t fix everything in a week, but you can put in the bones of a reliable ML system.

Day 1–2: Make the loops observable

Log the right things for every prediction:
- Request ID, timestamp
- Model/version ID
- Raw input keys (hashed/PII-safe where needed)
- Feature vector snapshot (or at least a stable hash and key stats)
- Model output(s) and decision threshold(s)
- Latency breakdown (feature fetch, model inference, post-processing)
Define join keys for outcomes:
- For each prediction type, define “how will we know if this was good or bad?”
- Example: recommendation → downstream event clicked(item_id, user_id, request_id) within time-window
- Make sure the keys required to join are actually logged on both sides.

Day 3–4: Establish minimal evaluation and drift monitoring

Compute basic metrics by cohort and over time:
- Start with:
  - Global accuracy / ranking metric
  - Metric by at least 2–3 key segments (e.g., geography, customer tier, device)
- Plot daily values for the last N days to see volatility and trends.
Add 2–3 simple drift indicators:
- For key features: compare current distribution vs. training using:
  - KS statistic or population stability index, or even just mean/std and quantiles
- Alert on large deviations or monotonic trends.
- Track prediction distribution drift as well (e.g., average score, fraction above threshold).
Wire basic alerts:
- Don’t overcomplicate this early. Examples:
  - “If metric X drops >Y% from 30-day median for Z hours, alert.”
  - “If P95 latency exceeds SLO for N minutes, alert.”

Day 5–6: Put guardrails around cost and latency

Quantify inference economics:
- For each model / retrieval layer:
  - Estimated cost per 1k predictions (infra + licenses)
  - P50 / P95 / P99 latency
- Attach to business metrics:
  - e.g., “This recommender adds ~3% revenue per session at $0.15 per 1k predictions.”
Define a latency and cost budget:
- Example:
  - Max additional latency from ML stack per request: 100ms at P95
  - Max ML infra spend as % of revenue from the affected surface
- Use this to evaluate future “just add another model” ideas.

Day 7: Document contracts and decisions

Write a 1-page “ML system contract” per major model:
- Purpose and key business metric(s)
- Features used (with owners)
- Label definition and source
- Retrain policy (trigger conditions, frequency)
- SLOs: metrics, latency, cost, and what triggers rollback
- Monitoring:

You Don’t Have an ML Model Problem, You Have a Feedback Loop Problem

Why this matters this week

What’s actually changed (not the press release)

How it works (simple mental model)