Your ML System Isn’t “Done” at Launch: A Pragmatic Guide to Evaluation & Drift

Table of Contents

Why this matters this week

Most teams I talk to are in the same place:

The first production model is live (or about to be).
Infra is in place: feature store, model registry, some monitoring, maybe offline eval.
But the feedback loop is half-baked:
- No clear owner for post-deploy performance.
- Metrics exist but don’t drive decisions.
- Drift alerts either don’t exist or are so noisy they’re ignored.
- Cloud bills are up and nobody can say if the extra GPU spend is worth it.

Meanwhile:

Models are being retrained faster.
Data sources are changing faster.
Business expectations are higher: “We’re paying for this ML thing; is it actually moving our KPI?”

This week’s focus: applied ML in production—specifically, how to set up evaluation, monitoring, drift detection, feature pipelines, and cost/performance trade-offs so your system gets more reliable over time instead of silently rotting.

If you already have models in prod, you likely have latent risk:
– You think performance is stable because test accuracy hasn’t moved.
– Reality: your input distribution shifted months ago; your model just hasn’t failed loudly yet.

What’s actually changed (not the press release)

Three concrete shifts are driving how production ML needs to be run today:

Feature surfaces are richer and more volatile.
- Ten years ago: daily batch features from stable DB tables.
- Now: streaming features, user embeddings, third-party signals, LLM-derived features.
- Implication: Feature pipelines are a major source of drift, not just the raw business data.
Serving cost is a first-class constraint.
- GPU/accelerator use is no longer “R&D budget.” It hits the same P&L as everything else.
- Latency SLOs are tighter (mobile, real-time bidding, in-app personalization).
- Implication: You must explicitly trade off accuracy vs. latency vs. cost instead of just chasing “best model on leaderboard.”
Data feedback is fragmented and delayed.
- Labels are delayed (fraud resolution, claims, churn).
- User feedback is noisy/implicit (clicks, dwell time).
- Some events never get logged cleanly.
- Implication: Relying solely on label-based metrics (AUC, F1) is too slow; you need proxy metrics and structural checks.

This isn’t about shiny tools. It’s about treating ML systems like production systems—with observability, budgets, and ownership—rather than frozen Jupyter notebooks.

How it works (simple mental model)

Use this mental model: your ML system is a closed loop with four surfaces you can instrument and control.

Data surface
What you feed the model.
- Distribution of features (means, variances, category frequencies).
- Schema (types, null rates, presence of columns).
- Data freshness (latency between real-world event and when it becomes a feature).
- Key idea: Compare current production distribution vs. training distribution continuously.
Model surface
How the model behaves given inputs.
- Output distributions (scores, class probabilities).
- Calibration (do 0.8 scores actually correspond to ~80% positives?).
- Sensitivity to features (feature importance, SHAP, counterfactual tests).
- Key idea: Validate that the structure of decisions stays sane, not just the final metric.
Serving surface
The infra path from request to response.
- Latency (p50/p95/p99).
- Error rates, timeouts, fallbacks invoked.
- Resource usage (CPU/GPU/memory per request).
- Key idea: Treat models as another microservice with SLOs, not as a “special” component.
Impact surface
What the model does to the business and users.
- Direct metrics: conversion, fraud loss, spend, resolution rate.
- Guardrail metrics: complaint rate, manual override rate, support tickets.
- Counterfactual checks: “what if we turned it off?” (A/B tests, shadow deploys).
- Key idea: Link model metrics to at least one business metric with dollars attached.

Overlaying this:

Evaluation = Periodic, structured checks across the four surfaces (offline tests, backtests, A/B).
Monitoring = Continuous tracking with alerts and dashboards on the same surfaces.
Drift = Meaningful change on the data or model surface that predicts degradation on the impact surface.
Cost/perf = Constraints on the serving surface that bound how fancy your model/feature pipelines can be.

If you’re missing visibility on any one surface, you’re flying partially blind.

Where teams get burned (failure modes + anti-patterns)

1. Static test set worship

Pattern:
– Model is “validated” against a holdout set from last year’s data.
– That set becomes the canonical benchmark forever.
– Offline quality looks great; prod metrics keep slipping.

Why it hurts:
– Business, user behavior, and upstream data sources move. Your test set doesn’t.
– You end up optimizing for historical patterns, not current conditions.

Fix:
– Maintain a rolling evaluation window (e.g., last 30 days of labeled data) and track metrics over time.
– Version your test sets by time and scenario, not just by random split.

2. Naïve drift detection (alert fatigue)

Pattern:
– Someone wires up KS-tests or PSI on every feature.
– Small, harmless drifts cause constant alerts.
– Team silences or ignores them; real issues slip by.

Why it hurts:
– Drift signals are not prioritized by impact.
– You can’t distinguish “seasonality” from “pipeline bug.”

Fix:
– Track drift prioritization:
– Focus on features with high importance / high usage.
– Incorporate business seasonality into baselines (e.g., compare Monday to previous Mondays).
– Alert on composite signals (e.g., top-k important features with significant drift + drop in calibration).

3. Data/feature pipeline ambiguity

Pattern:
– No single owner for feature definitions.
– Offline features are computed differently than online ones.
– Someone “optimizes” a SQL query, silently changing semantics.

Example:
– A recommendation system where “active user” was defined as “logged in within 30 days.”
– Analytics team changed their definition to 7 days; feature pipeline piggybacked on the same table.
– Model appeared to drift overnight; conversion dropped several points.

Fix:
– Treat feature definitions as versioned, reviewed code with:
– Ownership (team + on-call).
– Contract (schema + semantics).
– Change logs (what changed, why, expected impact).
– Build a simple “training vs. serving” feature parity check pipeline.

4. Overpowered models that blow your latency/$$$ budget

Pattern:
– Team swaps a small model for a large one because it’s +2% AUC.
– Serving moves from CPU to GPU; latency doubles; infra bill jumps 3x.
– Business KPI doesn’t visibly improve.

Example:
– Real-time bidding system switched to a deep model that added ~20ms at p95.
– That 20ms made them less competitive on auctions; revenue per mille actually fell.

Fix:
– Every model candidate needs a model card that includes latency and cost at expected load.
– Use Pareto curves: accuracy vs. latency vs. cost; ban regressions where cost/latency spike without clear business uplift.

5. Unlabeled domains with wishful thinking

Pattern:
– For domains like anomaly detection or recommendation, teams skip ground-truth work.
– Rely on qualitative demos and anecdotal feedback.
– Model gradually shifts to a weird regime with no one noticing.

Fix:
– Even in hard-to-label domains, define minimal viable labels:
– “Was this recommendation hidden/removed by user?”
– “Was this anomaly dismissed as false positive by ops?”
– Use human-in-the-loop review on a small, rotating slice of traffic to maintain a labeled evaluation stream.

Practical playbook (what to do in the next 7 days)

Assume you already have at least one model in production. Here’s a concrete 7‑day plan.

Day 1–2: Define and wire the minimal metric set

Pick exactly one primary metric per model
- Example: For a fraud model: “expected $ fraud prevented per 1000 transactions.”
- For ranking: “click-through rate uplift vs. baseline.”
- For classification: “cost-weighted error (false positive/negative costs).”
Add three supporting metric categories:
- Data: # of missing values per feature, feature drift score (for top 10 features), data freshness.
- Model: calibration error, output distribution (histogram of scores), volume by segment.
- Serving: p50/p95/p99 latency, error rate, fallback usage.
Get them into one dashboard per model:
- Goal: An on-call engineer can answer in 2 minutes:
  - “Is traffic normal?”
  - “Has input data changed?”
  - “Has output behavior changed?”
  - “Is business impact moving?”

Day 3–4: Implement basic drift and regression checks

Baseline your feature distributions
- Compute mean, std, cardinality, and missing fraction per feature on:
  - Training data.
  - Last 7 days of production data.
- Flag:
  - >X% shift in mean/median for important features.
  - New categories appearing in high-cardinality features.
  - Sudden jumps in missingness.
Set up a nightly “canary evaluation” job
- Use the latest labeled data you have (even if delayed by weeks).
- Compute primary metric vs.:
  - Previous model version.
  - Simple heuristic/baseline.
- Store results with timestamps and config versions.
Add a simple guardrail alert
- Example: If primary metric drops >Y% vs. 2‑week average (where labels are available), open an incident.
- If you don’t have labels yet, use proxy metrics (e.g., extreme score rates, calibration drift, sudden shifts in output distribution).

Day 5: Cost and latency accounting

Run a 24-hour cost/perf snapshot
- Measure:
  - Total inference calls.
  - Total compute cost (or at least instance usage attributable to this model).
  - Latency distribution.
- Compute **cost per 1