Your ML System Is Not “Done” at Launch: A Pragmatic Guide to Evaluation, Monitoring & Drift

Why this matters this week
Most teams now have at least one machine learning system in production. Fewer have one they actually trust.
What’s changed recently isn’t that “AI is everywhere”; it’s that:
- Inference volumes (and costs) jumped, especially with large models.
- Regulatory and internal risk teams are asking for hard evidence: evaluation, monitoring, and controlled behavior under drift.
- Stakeholders want to treat ML services like any other production service: SLOs, incident playbooks, unit tests, and cost/performance dashboards.
If you’re a tech lead or CTO, the issue isn’t “can we build a model?” but:
- Can we measure whether it’s doing something useful in the wild?
- Can we detect when it goes off the rails before customers notice?
- Can we control infra and inference costs while keeping quality acceptable?
- Can we change it safely without multi-week re-approval cycles?
This post is a practical blueprint: how to structure evaluation, monitoring, drift handling, feature pipelines, and cost/perf trade-offs so your ML stack behaves like a production system, not a research prototype.
What’s actually changed (not the press release)
Three concrete shifts that matter operationally:
-
Feedback and evaluation moved “inside” the product
More teams are shipping:
- Embedded feedback widgets (thumbs up/down, flags).
- Automatic outcome logging (did the user click, reply, convert, churn?).
- Shadow traffic and A/B infra for models.
This creates continuous evaluation data, which changes how you:
- Decide when to retrain.
- Promote a candidate model.
- Tune costs (model size, context length, retrieval depth).
-
Monitoring tools caught up (somewhat)
You can now:
- Track input/output distributions in near real-time.
- Define alerts on business-level metrics (e.g., approval rates, refunds, fraud) correlated with model changes.
- Compare model variants under the same traffic.
Still immature, but far better than “log the predictions and pray.”
-
Inference costs became very real
With more complex models (and often LLMs), you now live with:
- Per-request inference costs that you can’t ignore.
- Latency spikes under load.
- GPU/TPU utilization as a first-class capacity planning topic.
This forces explicit cost/performance trade-offs: smaller model vs. higher error rate vs. caching vs. retrieval depth vs. context length.
How it works (simple mental model)
Treat your ML system as a closed loop rather than a one-time model deployment:
-
Data → Features → Model → Decision → Outcome → Feedback
- Data: raw logs, user input, context.
- Features: transformations (embeddings, aggregates, normalization).
- Model: the actual predictor/generator/ranker.
- Decision: what your product does with the output (show item, block transaction, suggest response).
- Outcome: what actually happened (user clicked, fraud chargeback, support ticket reopened).
- Feedback: structured signals you can use to evaluate and retrain.
-
Three loops running at different speeds
-
Fast loop (seconds–minutes): Online monitoring
- Detect anomalies: traffic drops, distribution shifts, error spikes.
- Guardrails: hard constraints (never output X, never exceed threshold Y).
-
Medium loop (hours–days): Evaluation and model comparison
- Batch evaluation against labeled or proxy labels.
- A/B tests on candidate vs. baseline models.
- Cost and latency measurements.
-
Slow loop (weeks–months): Retraining and pipeline evolution
- Regenerate features.
- Retrain models on new data.
- Revisit architecture (e.g., feature store, retrieval strategy).
-
-
Four subsystems you should name and own
- Evaluation: How good is the model now, on which tasks, under what conditions?
- Monitoring: Is it behaving as expected in production?
- Drift management: How has the world or data changed since training?
- Infra + costs: How do we serve it reliably within budget?
If you don’t explicitly define each subsystem, you get an accidental version of it, usually implemented as a pile of ad-hoc scripts and dashboards no one fully trusts.
Where teams get burned (failure modes + anti-patterns)
1. “We’ll figure evaluation out later”
Pattern: Teams ship a model with only an offline accuracy metric and a vague plan to “watch metrics.”
How it bites you:
– Offline metrics don’t capture product-level impact (e.g., CTR up but revenue flat).
– You have no baseline for “acceptable degradation” when drift happens.
– Experiments are inconclusive; disagreements devolve to “it looks better to me.”
Fix:
– Define one primary metric per use case that ties to business impact (e.g., correctly auto-resolved tickets / total tickets).
– Lock in a baseline model with known performance.
– Require a predefined win condition before promoting a new model.
2. Monitoring only infra, not behavior
Pattern: You monitor CPU, latency, and error rates but not the model outputs or input distributions.
Real example pattern:
– Recommendation API looks healthy (99.9% uptime, low latency).
– Users complain about irrelevant results for weeks.
– Root cause: a new partner data source changed product IDs; features became meaningless, but no infra metric noticed.
Fix:
– Add input feature distribution monitoring (mean, std, histograms, top-k categories).
– Add output distribution monitoring (e.g., score histograms, action frequencies).
– Set alerts on silent failures: e.g., percentage of empty recommendations, generic responses, or low-confidence outputs.
3. Ignoring drift until things break
Pattern: “Our model worked fine for a year,” then suddenly performance craters due to:
– Seasonality (holidays, promotions).
– Policy or product changes.
– Platform shifts (search algorithm, app UI).
Real example pattern:
– A fraud model trained on pre-launch behavior.
– Post-launch, new users behave differently; legitimate patterns look “fraudulent.”
– Support gets flooded; model is blamed, but data distribution actually changed.
Fix:
– Define explicit drift indicators:
– Population statistics (KL divergence, PSI, etc. on key features).
– Model confidence over time.
– Performance by segment (new vs. existing users, markets).
– Decide in advance:
– What triggers investigation?
– When to roll back to an older model?
– When to retrain vs. redesign features?
4. Feature pipelines as an afterthought
Pattern: Features are computed:
– Inline in application code.
– Differently between training and serving paths.
– Without versioning or schema checks.
Real example pattern:
– Normalization applied with training-time mean/std in offline training.
– In production, someone reimplements normalization with a different constant or forgets handling nulls.
– Model “degrades” without any code changes in the model itself.
Fix:
– Treat features as a first-class artifact:
– Schema (name, type, distribution).
– Versioned transformations.
– Single source of truth for train and serve.
– Add:
– Training-serving skew checks (compare feature statistics between training data and live traffic).
– Canary feature deployments: roll out new feature versions gradually.
5. Cost/perf trade-offs hand-waved
Pattern: “We’ll just use the best model” without:
– Tracking cost per 1k predictions.
– Measuring latency SLOs under production load.
– Considering cheaper alternatives (smaller model + caching + retrieval).
Real example pattern:
– LLM-based summarization used inline in user flows.
– Latency spikes and per-request costs exceed plan as volume grows.
– Team is forced into rushed optimization under fire.
Fix:
– For each ML service, define:
– Target median and P95 latency.
– Max cost per 1k calls.
– Graceful degradation behavior (fallback models, shorter context, simpler retrieval).
– Measure and compare at least two model configurations:
– “Fat” (best quality, higher cost).
– “Lean” (acceptable quality, cheaper).
– Sometimes a hybrid: lean by default, fat for high-value or ambiguous cases.
Practical playbook (what to do in the next 7 days)
Assume you already have at least one model in production. The goal this week is not perfection; it’s to get from “we hope it works” to “we can see and control what it’s doing.”
Day 1–2: Define your evaluation contract
- Pick one production use case (e.g., ranking, classification, QA, summarization).
- Write down:
- Primary success metric (tied to product outcome).
- 1–2 secondary metrics (e.g., precision/recall, response length).
- A baseline (model version + current metric value).
- Implement a daily batch evaluation:
- Sample predictions + outcomes from logs.
- Compute metrics and store as time series.
- Make it visible on the same dashboard you use for infra.
Keywords to integrate reasonably: model monitoring, ML observability, model evaluation, data drift, feature pipelines, machine learning in production, inference costs, model performance.
Day 3–4: Add minimum viable monitoring
-
Add logging (if missing):
- Input features (or a hashed/aggregated representation if sensitive).
- Model outputs.
- Decision taken.
- Outcome / label when available.
-
Stand up basic model monitoring:
- Track:
- Request rate, latency, failure rate.
- Input feature distribution for 3–5 key features.
- Output distribution (score histograms, decision counts).
- Set two alerts:
- Sudden drop or spike in traffic or latency.
- Major distribution shift in at least one critical feature or output.
- Track:
-
Document an incident playbook:
- Who gets paged.
- Immediate mitigations (rollback, feature flag, fallback model).
- What is logged for postmortem (model version, data snapshot, config).
Day 5: Drift checks and feature sanity
-
Implement a simple drift detector:
- For each selected feature:
- Track training distribution summary (mean/std, top categories).
- Compare to last 24–72 hours of production.
- Raise a warning, not a hard alert, when divergence exceeds a threshold.
- For each selected feature:
-
Add schema and value checks on features
