Your ML System Is Not “Done” at Launch: A Pragmatic Guide to Evaluation, Monitoring, and Drift

Why this matters this week
A pattern is repeating across teams rolling out applied machine learning systems:
- Models ship that look great in offline benchmarks, then quietly decay in production.
- Infra cost for “AI features” creeps up 3–5x over a quarter with no corresponding business lift.
- Incidents are now “the model did something weird” instead of obvious bugs.
The common thread isn’t usually model architecture. It’s everything around the model:
- Evaluation that matches reality, not the Kaggle version of it
- Monitoring that treats models as probabilistic systems, not if-else code
- Data / concept drift detection that catches gradual failures, not only hard outages
- Feature pipelines that behave like production systems, not research notebooks
- Cost/perf trade-offs made with latency, recall, and cloud bills on the same graph
If you own any production ML system (recommendations, ranking, fraud detection, anomaly detection, LLM-based copilots, etc.), the window for “this is new so it’s allowed to be flaky” is over. Your stakeholders now expect SLO-grade reliability from probabilistic systems.
This post is about practical mechanisms you can implement in a week, not tooling fashion.
What’s actually changed (not the press release)
Three concrete shifts in applied ML over the last 12–18 months:
-
The stack is “good enough”; integration is the hard part
- Model quality (including open-source) is often not the bottleneck.
- The hard bits:
- Integrating with legacy data stores and message buses
- Keeping feature definitions coherent across batch/stream/online
- Getting model decisions into user-facing paths with predictable latency
- You’re less constrained by “can we learn this?” and more by “can we operate this reliably?”
-
Data drift and feedback loops are now “business as usual”
- User behavior, fraud patterns, supply/demand, and product UX all change faster than your retrain cadence.
- ML systems now shape the environment they learn from (ranking, dynamic pricing, spam detection).
- This creates:
- Non-stationary data distributions
- Second-order effects (e.g., model suppresses some content → less labeled data for that slice → poorer performance → more suppression)
-
Inference cost actually shows up in your margin analysis
- At small scale, an extra 100ms or $0.001 per call is ignorable.
- At 10–100M calls/day, that’s:
- 100ms: days of extra latency per day across users
- $0.001: $10–100k/month line items
- CFOs now ask: “Did the extra spend increase conversion / reduce fraud / improve retention?”
- You need end-to-end metrics that tie model variants to business KPIs and infra cost, not just ROC-AUC.
The net: production ML is less about clever architectures and more about treating models as services with SLOs, budgets, and change management.
How it works (simple mental model)
Use this mental model for any production ML system:
-
Feature pipeline layer
- Responsible for:
- Ingesting raw data (events, logs, DB changes)
- Transforming into stable, versioned features
- Serving those features to training and inference consistently
- Key contracts:
- Schema (names, types, nullability)
- Statistical profile (ranges, distributions)
- Latency guarantees
- Responsible for:
-
Model layer
- Given features → outputs (scores, embeddings, labels).
- You care about:
- Calibration (probabilities mean what they say)
- Robustness to missing / stale / noisy features
- Sensitivity to each feature (so you know what drift matters)
-
Decision / policy layer
- Wraps model predictions with business logic:
- Thresholds, fallback rules
- Safety constraints (don’t auto-approve >$X loans)
- A/B routing, experimentation, rate limits
- Often where post-processing bugs cause bigger problems than the model itself.
- Wraps model predictions with business logic:
-
Evaluation & monitoring layer
- Three separate concerns:
- Pre-deploy: offline evaluation on realistic data + stress cases
- Post-deploy: online metrics (latency, error rates, traffic split), partial label feedback
- Drift detection: monitoring:
- Input drift (feature distributions change)
- Output drift (prediction distribution shifts)
- Performance drift (metrics degrade when labels arrive)
- Three separate concerns:
-
Control loops
- How the system changes itself:
- Automated retraining pipelines
- Threshold tuning
- Policy updates
- Without guardrails, these loops create silent regressions.
- How the system changes itself:
If you can name each of these layers and say “this is where X lives,” you’re ahead of many production ML stacks.
Where teams get burned (failure modes + anti-patterns)
1. Offline eval doesn’t match online reality
Example 1 (consumer app recommendations):
- Offline, model improved NDCG@10 by 15%.
- In prod, click-through dropped.
- Root causes:
- Training data sampled from heavy users; new users underrepresented.
- Eval didn’t model “cold-start” distribution.
- Online, system sees many more short sessions → model overfits to deep-session behavior.
Anti-patterns:
- Single global metric, no segment-level breakdown (e.g., new vs returning, country, device).
- No evaluation on temporal holdouts (e.g., last 2 weeks as a block).
2. Schema and feature drift silently poisoning models
Example 2 (fraud detection):
- Upstream service changed a field from cents → dollars.
- No schema enforcement; model kept seeing a 100x drop in transaction values.
- Fraud score distribution shifted, but no alerting on output drift.
- Fraud slipped through for 3 weeks until finance saw chargebacks spike.
Anti-patterns:
- No contracts between data producers and feature consumers.
- Treating feature pipelines as “just ETL” instead of production code with tests.
- No per-feature drift monitoring; only watching aggregate model metrics.
3. “One-size-fits-all” retraining
Example 3 (B2B SaaS churn model):
- Cron job retrains nightly on the last 12 months of data.
- A sudden pricing change in one region altered behavior in that segment.
- Global model averaged everything; performance in that region cratered.
- Labels (churn) arrive with months of lag → performance issues discovered far too late.
Anti-patterns:
- Single global model updated at a fixed cadence regardless of:
- Label delay
- Seasonality
- Data volume per segment
- No shadow evaluation comparing candidate models before promotion.
4. Cost explosions from “just ship the big model”
Example 4 (LLM-based support reply suggestions):
- MVP used a large model for every keystroke or ticket open.
- Latency okay; early adoption positive.
- Scale-up → inference bill > support team payroll, plus p95 latency blew past SLO.
- Response: frantic patchwork of caching, heuristic gating, and “weekend migrations.”
Anti-patterns:
- No per-request cost budget or latency SLO at design time.
- No graceful degradation (small model, heuristic fallback, lower frequency calls).
- Treating model selection as a research choice instead of a cost/perf engineering decision.
Practical playbook (what to do in the next 7 days)
Assume you have at least one ML system in production already.
Day 1–2: Make the current system legible
-
Draw the actual data + decision flow
- Whiteboard or doc:
- Where does training data come from?
- How do features get computed for training vs inference?
- Where does the model run? (service, batch job, embedded)
- What downstream systems depend on the outputs?
- Identify:
- Single points of failure
- Places where “someone just runs a script”
- Whiteboard or doc:
-
Write down current metrics
- Model-level: AUC/precision/recall, NDCG, etc.
- Business-level: conversion, churn, fraud caught, average handle time.
- System-level: p50/p95 latency, error rates, QPS, infra cost.
- Ask: “Can we see these broken down by segment and by model version?”
Day 3: Add minimal but critical monitoring
Priority is cheap wins that catch common failure modes.
-
Input monitoring
- For 5–10 top features:
- Track mean, std, missing rate, cardinality (for categoricals).
- Alert if:
- Missing rate jumps beyond historical range.
- Distribution divergence (e.g., KS test, PSI) crosses threshold.
- For 5–10 top features:
-
Output monitoring
- Track:
- Score distributions per model version.
- Decision rates (e.g., approvals per segment).
- Alert on:
- Sudden shifts (e.g., approvals drop 50% in a day for a region).
- Track:
-
Latency + cost
- For each model endpoint:
- p50/p95 latency
- QPS
- Rough cost per 1k requests (even if estimated)
- Set explicit SLOs:
- e.g., p95 < 300ms, $X per 1k requests.
- For each model endpoint:
Day 4–5: Tighten evaluation to match reality
-
Define 3–5 key slices
- Examples:
- New vs returning users
- High vs low transaction value
- Region or language
- Device type
- Ensure offline eval reports metrics by slice.
- Examples:
-
Temporal validation
- Move to time-based splits:
- Train on older data, validate on the most recent (e.g., last 2–4 weeks).
- Compare:
- “Random split” metrics vs “temporal split” metrics.
- If there’s a big gap, your offline eval has been overstating performance.
- Move to time-based splits:
-
Create a simple “canary eval” harness
- Before promoting any new model:
- Run it on the last N days of production-like data.
- Compare:
- Overall metrics
- Key slice metrics
- Decision rate changes (e.g., approvals up 20%: intentional or not?)
- Before promoting any new model:
Day 6: Add basic drift detection
Pick a small scope and implement end-to-end.
-
Select 3–5 critical features
- High importance in the model
- Known to change over time (e.g., prices, counts, user activity)
-
Compute drift daily
- For each
