You Don’t Have an ML Model Problem, You Have a Feedback Loop Problem

Why this matters this week
Most “production ML” incidents I’ve heard about in the past month weren’t about model quality at all. They were about:
- Silent regressions after a retrain
- Feature pipelines drifting away from their “as-trained” definitions
- Retrievers or ranking models getting more expensive than the value they create
- Monitoring dashboards that look healthy while business metrics crater
Three themes keep repeating:
-
Evaluation happens offline, reality happens online.
Teams can recite their ROC-AUC but can’t answer “What’s the P95 error on revenue per session for users touched by the model in the last 72 hours?” -
Data contracts are assumed, not enforced.
Feature pipeline changes (schema, distribution, semantics) roll out like any other data change—except your model silently overfits to last quarter’s behavior. -
Cost and latency don’t get modeled as first-class metrics.
Many orgs bolt on vector search, larger models, and more features without a clear SLO or unit-economics model, then get surprised when infra spend doubles.
If you run ML in production, your real job is designing closed-loop systems: evaluation, monitoring, drift detection, and control of cost/latency. Models are just modules inside that loop.
What’s actually changed (not the press release)
Three concrete shifts over the last 12–18 months (especially with applied ML and LLM-based systems):
-
Live evaluation is now tractable, if you design for it.
- Logging complete inputs, model outputs, and outcomes is cheaper and more common.
- Feature stores and event streams make it possible to retroactively build labels and metrics.
- The bottleneck isn’t tech, it’s schema discipline: deciding what to log and how to join it.
-
Traffic patterns and user behavior are less stationary.
- LLM-augmented search, recommendations, and personalization change how users explore products.
- Your old “once-a-quarter retrain is fine” assumption is less valid because your ML system affects user distribution, which affects future data.
- Covariate shift and feedback loops are no longer edge cases; they’re the default.
-
Inference cost / latency trade-offs actually matter at scale.
- Embedding and retrieval for every request adds up.
- More complex feature pipelines move compute earlier in the request path.
- For many teams, retrieval + features + glue is more expensive than the model itself.
In other words, we’re not in a world where better offline benchmarks are the bottleneck; we’re in a world where end-to-end ML system design is.
How it works (simple mental model)
Here’s a minimal mental model I use when reviewing ML systems in production. Think of four loops:
-
Prediction loop (per-request path)
- Input → feature pipeline → model → post-processing → action
- Constraints: latency, availability, and cost per call
- Observable: prediction logs, feature values, model version, timing
-
Outcome loop (delayed labels)
- Action → user/environment response → business event (label)
- Example:
- Credit model: prediction today, default or repayment over months
- Recommendation: click/engagement within hours/days
- This loop defines the ground truth that determines if your model is useful.
-
Evaluation loop (offline + online)
- Join prediction logs with labels → compute metrics over cohorts and time windows
- Two key classes of metrics:
- Model-centric: accuracy, ranking metrics, calibration, error by segment
- System-centric: revenue, risk, throughput, tail latency, infra spend
- You need both. Offline model metrics alone are easy to game.
-
Adaptation loop (retraining + config changes)
- Decide when to retrain, when to roll back, when to ship new features
- Gate new models with:
- Offline regression tests
- Shadow/traffic-split tests
- Guardrails on safety, cost, and key business metrics
- This loop ensures you don’t “optimize yourself off a cliff.”
Each loop has its own failure modes, and most outages come from poor interfaces between loops:
- Predictions logged but not joinable with labels → no trustworthy evaluation
- Feature pipelines evolving faster than models → prediction loop misaligned with evaluation data
- Retrain jobs triggered on a cron instead of on data conditions → overfitting or stale models
If you can diagram these four loops for your system, you can reason about monitoring and drift.
Where teams get burned (failure modes + anti-patterns)
1. Offline heroics, online blindness
Anti-pattern:
– Huge energy on model selection, benchmarking, and hyperparameters
– Thin or missing online metrics and cohort breakdowns
– Incidents discovered via finance or support, not monitoring
Real-world example pattern:
– A marketplace deploys a new ranking model with +4% NDCG offline.
– Overall revenue goes up 1%, but a high-value seller segment sees -10% orders.
– The team has no per-segment online evaluation; they only notice after the quarterly business review.
What would have prevented it:
– A small set of guardrail metrics by segment (e.g., revenue and exposure by seller class) tied to the deployment pipeline.
2. “Feature store” without contracts
Anti-pattern:
– Central feature store, but schemas are soft suggestions, not contracts.
– Upstream analytics teams freely change semantics (e.g., “active_user” definition) without versioning.
– Training data is built from historical snapshots that don’t reflect current online transformations.
Real-world example pattern:
– A fraud model uses total_spend_30d.
– Finance changes the definition to exclude refunds and chargebacks for reporting reasons.
– No one re-trains the model; fraud scores shift, manual review queues spike.
What would have prevented it:
– Versioned feature definitions with CI checks that block changes unless:
– The owner acknowledges downstream model usage, or
– A new feature version is created and rolled out intentionally.
3. Silent data drift and feedback loops
Anti-pattern:
– Covariate drift detection is either missing or global only.
– No notion of policy-induced drift (model changes user behavior).
– Retrain schedules are calendar-based (“every X weeks”) rather than data-based.
Real-world example pattern:
– A recommendations model optimizes for short-term clicks.
– Over time, it surfaces more sensational content; user behavior shifts.
– Training data increasingly reflects the model’s own biases, so retrains make it worse.
What would have prevented it:
– Monitoring both:
– Input feature distributions vs. training baseline
– Response distribution and delayed outcomes over time
– Constraints that penalize short-term optimizations that destroy long-term metrics.
4. Ignoring cost and latency until the bill arrives
Anti-pattern:
– New retrieval layer, embeddings, and re-rankers introduced incrementally.
– No explicit latency budget or cost per 1k requests target.
– Scaling decisions made reactively after infra costs surpass some informal “too much.”
Real-world example pattern:
– A search team adds semantic search plus personalization features.
– P99 latency creeps from 150ms to 550ms; infra spend doubles.
– They only notice when SREs push back and leadership asks “why is infra up 2x?”
What would have prevented it:
– For each major ML component: defined SLOs and unit costs, and a simple spreadsheet model of “dollars per incremental metric gain.”
Practical playbook (what to do in the next 7 days)
You can’t fix everything in a week, but you can put in the bones of a reliable ML system.
Day 1–2: Make the loops observable
-
Log the right things for every prediction:
- Request ID, timestamp
- Model/version ID
- Raw input keys (hashed/PII-safe where needed)
- Feature vector snapshot (or at least a stable hash and key stats)
- Model output(s) and decision threshold(s)
- Latency breakdown (feature fetch, model inference, post-processing)
-
Define join keys for outcomes:
- For each prediction type, define “how will we know if this was good or bad?”
- Example: recommendation → downstream event
clicked(item_id, user_id, request_id)within time-window - Make sure the keys required to join are actually logged on both sides.
Day 3–4: Establish minimal evaluation and drift monitoring
-
Compute basic metrics by cohort and over time:
- Start with:
- Global accuracy / ranking metric
- Metric by at least 2–3 key segments (e.g., geography, customer tier, device)
- Plot daily values for the last N days to see volatility and trends.
- Start with:
-
Add 2–3 simple drift indicators:
- For key features: compare current distribution vs. training using:
- KS statistic or population stability index, or even just mean/std and quantiles
- Alert on large deviations or monotonic trends.
- Track prediction distribution drift as well (e.g., average score, fraction above threshold).
- For key features: compare current distribution vs. training using:
-
Wire basic alerts:
- Don’t overcomplicate this early. Examples:
- “If metric X drops >Y% from 30-day median for Z hours, alert.”
- “If P95 latency exceeds SLO for N minutes, alert.”
- Don’t overcomplicate this early. Examples:
Day 5–6: Put guardrails around cost and latency
-
Quantify inference economics:
- For each model / retrieval layer:
- Estimated cost per 1k predictions (infra + licenses)
- P50 / P95 / P99 latency
- Attach to business metrics:
- e.g., “This recommender adds ~3% revenue per session at $0.15 per 1k predictions.”
- For each model / retrieval layer:
-
Define a latency and cost budget:
- Example:
- Max additional latency from ML stack per request: 100ms at P95
- Max ML infra spend as % of revenue from the affected surface
- Use this to evaluate future “just add another model” ideas.
- Example:
Day 7: Document contracts and decisions
- Write a 1-page “ML system contract” per major model:
- Purpose and key business metric(s)
- Features used (with owners)
- Label definition and source
- Retrain policy (trigger conditions, frequency)
- SLOs: metrics, latency, cost, and what triggers rollback
- Monitoring:
