Your Monitoring Stack Is Not Ready for Machine Learning
Why this matters right now
Most teams now have at least one machine learning system in production: recommendations, fraud scoring, lead scoring, ranking, anomaly detection, forecasting. The initial POC usually “works” and gets shipped. Six months later:
- Performance silently decays and no one notices until a business KPI falls off a cliff.
- Retraining jobs break on a schema change and default to months-old models.
- Cloud bills spike due to unnoticed inference hotspots.
- Everyone blames “data drift” without being able to quantify it.
Traditional monitoring, QA, and release practices assume relatively static logic. Machine learning systems are different:
- They change behavior without code changes (data drift, retraining).
- They fail gracefully but dangerously (slightly worse predictions that don’t page anyone).
- Their correctness is probabilistic and contextual, not binary.
If you treat ML like another microservice, you get systems that look healthy in Grafana while quietly losing you money.
This post is about the applied side: evaluation, monitoring, drift, feature pipelines, and cost/perf trade-offs — what you actually need to ship reliable models, not a Kaggle leaderboard.
What’s actually changed (not the press release)
Three concrete shifts in the last ~3 years have changed how applied ML needs to be run:
-
Model capacity is cheap; everything else got expensive.
- Off-the-shelf models (tabular, vision, NLP, LLMs) are now very strong out of the box.
- The bottleneck isn’t “can we train a good model?” but:
- Can we feed it consistent, fresh, high-quality features?
- Can we run it at acceptable latency and cost?
- Can we tell when it’s degrading?
Cost/performance trade-offs have moved from GPUs in training to serving and data plumbing.
-
Data is more volatile and entangled with product changes.
- Frequent UI/UX experiments, marketing campaigns, pricing changes, and new markets all change:
- The distribution of inputs (features).
- The meaning of labels (e.g., “conversion” after a funnel redesign).
- This means drift is often product-driven, not “model got old.”
- Your model monitoring needs to understand deployment context, not just raw features.
- Frequent UI/UX experiments, marketing campaigns, pricing changes, and new markets all change:
-
ML infra looks more like traditional software infra, but expectations are higher.
- Feature stores, model registries, and inference services are mainstream.
- Teams expect:
- Blue/green model rollouts.
- Canary testing and staged rollbacks.
- SLAs/SLOs for latency and quality.
- But ML adds extra dimensions: feature freshness, label delay, implicit feedback loops.
The net: the hardest part is now operational excellence around ML, not the modeling itself.
How it works (simple mental model)
A simple, useful mental model for production ML: four loops that must stay in sync.
-
Feature loop (data → features → serving)
- Data sources (events, DBs, APIs)
- Transformations (ETL, feature pipelines)
- Online/real-time feature computation
- Serving features to models
Key properties:
- Consistency between training and serving code/logic.
- Freshness guarantees (max lag).
- Schema contracts and change management.
-
Model loop (training → selection → deployment)
- Offline training and evaluation.
- Model registry + metadata (version, timestamps, training data snapshot).
- Deployment strategies (shadow, A/B, canary).
- Rollback paths.
Key properties:
- Reproducible training.
- Comparable evaluation across versions.
- Clear decision gates: when is a model “good enough to ship”?
-
Feedback loop (predictions → outcomes → labels)
- Predictions logged with features and model version.
- Ground truth labels collected (immediate or delayed).
- Labeling noise and delay modeled explicitly.
Key properties:
- Address feedback delay (e.g., 30-day conversion windows).
- Correctly attributed outcomes to specific predictions.
- Clear label definitions that survive product changes.
-
Monitoring loop (signals → alerts → iteration)
- Technical metrics: latency, errors, throughput, resource utilization.
- Data/ML metrics:
- Input drift / feature distribution shifts.
- Prediction distribution monitoring.
- Business KPIs tied to model outputs.
- Alerting and playbooks.
Key properties:
- Time-bucketed metrics with baselines.
- Distinction between “weird but ok” and “break glass now”.
- Coverage of both infra and model performance.
Production ML goes wrong when these loops desynchronize. Example:
- Product changes checkout flow → label semantics change → model trained on old behavior → feature loop still running happily → monitoring only watches 500s/latency → silent business degradation.
Where teams get burned (failure modes + anti-patterns)
1. Treating evaluation as a one-time offline exercise
Pattern:
- Big offline AUC/loss/accuracy numbers.
- Model shipped based purely on held-out test set.
- No continuous measurement in production.
Consequences:
- Performance erodes due to population shift.
- Offline metrics don’t match live business impact.
What to do instead:
- Define online evaluation metrics tied to your objective (e.g., uplift in approval rate at constant risk; revenue per session).
- Implement model version tagging in logs so you can compute live metrics per model.
- Accept that offline metrics are screening tools, not go/no-go gates.
2. Feature pipelines that drift out of sync
Pattern:
- Separate code paths for feature engineering:
- One in a notebook/ETL job for training.
- Another in an online service for inference.
- They diverge over time.
Consequences:
- Training-serving skew: model sees different distribution in prod than in training.
- Bugs in one path are invisible until production metrics move.
Real-world example:
- A credit scoring team used a rolling 90-day average of transaction volume.
- Training used event timestamps; prod pipeline used processing time.
- Backfills changed the effective window; live traffic didn’t.
- Model appeared “biased” on new customers; root cause was feature skew.
What to do instead:
- Single source of truth for feature logic:
- Shared libraries, or
- Feature store with reusable, versioned transformations.
- Explicit tests that compare online vs offline feature values on the same entity/time.
3. Drift detection with no action path
Pattern:
- Implement fancy drift metrics (KL divergence, PSI, etc.).
- Alert channels get spammed about “drift detected.”
- No defined playbook for what to do.
Consequences:
- Alert fatigue; eventual disablement of checks.
- Drift is observed but not managed.
Real-world example:
- E‑commerce search relevance system flags drift every time there’s a big sale.
- Engineers ignore it because “we know traffic changes on Black Friday.”
- But later, a broken logging pipeline looked like “yet another sale spike” and went unnoticed.
What to do instead:
- Tie drift tiers to specific actions:
- Tier 1: expected drift → record, no action.
- Tier 2: moderate unexpected drift → human investigation within 24h.
- Tier 3: severe unexpected drift on critical features → freeze retraining / consider rollback.
- Encode known events (campaigns, launches) into your monitoring to avoid obvious false positives.
4. Ignoring cost/latency until it’s too late
Pattern:
- Start with a “good” but heavy model.
- Traffic grows; infra scales.
- Cloud bill and tail latency quietly explode.
Consequences:
- Forced last-minute optimization/rewrite.
- Pressure to cut corners on monitoring or resilience.
Real-world example:
- A personalization team moved from a linear model to a deep neural network for ranking.
- P95 latency went from 20ms to 150ms; infra costs increased 5x.
- Later they discovered that 80% of the business gain came from 20% of the complexity.
What to do instead:
- Maintain multiple model tiers:
- Cheap baseline (linear / gradient boosted trees) for fallback and bulk traffic.
- Expensive model only where incremental value is high.
- Track cost per 1,000 predictions and latency SLOs as first-class metrics.
5. No ownership for “ML in prod”
Pattern:
- Data science “owns” the model.
- Platform/infra “owns” the runtime.
- Product “owns” the KPIs.
- No one owns the end-to-end reliability.
Consequences:
- Cross-team finger-pointing when things break.
- Model updates blocked by unclear approval paths.
What to do instead:
- Create an explicit ML service owner (could be a tech lead).
- Responsible for model lifecycle, data quality, monitoring, and reliability.
- Document RACI for:
- Model updates.
- Schema changes.
- Retraining cadence.
- Incident response.
Practical playbook (what to do in the next 7 days)
Assume you already have at least one ML system in production. Here’s a concrete, bounded plan.
Day 1–2: Make the invisible visible
-
Inventory your ML systems
For each model/service, capture:- Owner (name, not team).
- Purpose and main business KPI.
- Where it runs (service name, endpoints).
- Where training happens (jobs, notebooks, pipelines).
-
Add minimal but critical logging
For each prediction:- Model version.
- Key input features (subset, not all).
- Prediction output and confidence/score.
- Request ID / user ID (if allowed) for join with labels.
This doesn’t require a new platform; append to existing logs.
Day 3–4: Establish basic evaluation in production
-
Define 1–3 core quality metrics per model
Examples:- Conversion model: calibrated probability (Brier score), lift vs baseline.
- Fraud model: precision at fixed recall, or vice versa.
- Ranking model: CTR, revenue per impression, or NDCG.
-
Compute these metrics by model version over time
- Start with simple batch jobs:
- Join prediction logs with labels.
- Aggregate daily.
- Visualize in your existing dashboard system.
- Start with simple batch jobs:
-
Align on acceptable degradation thresholds
- E.g., “If revenue per session drops by 3% for 3 consecutive days for this model version, trigger incident.”
Day 5: Add the cheapest drift monitoring that works
-
Monitor distributions for 5–10 critical features + outputs
- Track:
- Mean, stddev, min, max.
- Simple population histograms.
- Compare last 24h vs trailing 30d baseline.
- Track:
-
Alert only on large, unexpected shifts
- Example rule:
- If mean differs by >3 standard deviations AND not during a known campaign window → page on-call.
- Example rule:
Day 6: Lock down feature/training consistency
- Identify top 5 features by model importance
- Manually review how they’re computed:
- Is logic duplicated across codebases?
- Is there a clear schema contract?
- For these features:
- Document source tables/streams.
- Add a quick unit test or data test that:
- Compares training vs serving computation on a sample.
- Fails on schema changes.
- Manually review how they’re computed:
Day 7: Set an explicit retraining and rollout policy
-
Decide on a retraining cadence per model
- Example:
- Fraud: weekly.
- Recommendations: daily.
- Credit scoring: monthly/quarterly.
- Document:
- Trigger: time-based, performance-based, or both.
- Approval: who signs off.
- Example:
-
Define a simple rollout process
- Start with:
- Shadow mode or 5–10% canary for new model versions.
- Compare online metrics vs current production version.
- Automatic rollback if:
- Latency SLO broken.
- Business KPI worse by X% or more.
- Start with:
This 7‑day plan doesn’t require adopting a new feature store or MLOps platform. It stitches ML into your existing engineering practices with minimal, high-leverage changes.
Bottom line
Applied machine learning in production is now mostly an operations and systems problem, not a modeling problem.
If you want reliable, cost-effective ML systems:
- Treat models as first-class services with:
- Versioning, monitoring, and SLOs.
- Align the four loops:
- Features, models, feedback, monitoring.
- Invest early in:
- Training–serving consistency.
- Simple, actionable drift and evaluation.
- Ownership and clear rollout policies.
- Be explicit about cost vs performance trade-offs:
- Model complexity is a knob, not a destiny.
Teams that ship stable ML systems aren’t the ones with the most sophisticated architectures. They’re the ones who treat ML like any other critical production dependency — with boring, disciplined engineering practices — and adapt those practices to the realities of probabilistic, data-driven behavior.
