Your Models Aren’t Failing. Your Observability Is.

A dimly lit operations war room filled with large screens showing abstract data streams, drifting distributions, and network graphs, a few silhouetted engineers standing and pointing at the screens, cool blue and teal lighting with sharp contrast, wide cinematic composition emphasizing complex interconnected systems

Why this matters this week

Most teams now have at least one machine learning system in production. Fewer have any credible way to answer:

  • “Is this model still working as expected?”
  • “What changed in the data between last week and this week?”
  • “What’s our all-in cost per correct prediction?”

Over the last year, the bar has moved:

  • Evaluation can’t be a one-time project artifact; it has to be a living, monitored contract with reality.
  • LLMs and online models make drift more frequent and harder to see—you often don’t have nice labeled feedback, and input distributions shift faster than your dashboards update.
  • Infra is getting cheaper per FLOP, but total cost is exploding because nobody is tying inference cost to value.

This week’s post is about the unglamorous middle: evaluation, monitoring, data/feature pipelines, drift handling, and cost/perf trade‑offs for applied machine learning in production.

If your team is still evaluating models only on an offline test set and calling it done, you’re flying without instruments.


What’s actually changed (not the press release)

Several practical shifts in the last 12–18 months have made “good enough” monitoring and evaluation obsolete.

  1. More models are in the critical path

    • ML used to power “nice to have” features (recommendations, rankings).
    • Now it’s:
      • Fraud / risk scoring
      • Search relevance for core workflows
      • Code generation for production systems
      • Routing of customer tickets or payments
    • Failure now shows up as outages, not just lower CTR.
  2. Data drift is faster and weirder

    • LLMs and generative systems change user behavior (e.g., new prompt styles, different query formulations).
    • Regulation and product changes shift the distributions of core features (e.g., new KYC rules, new pricing tiers).
    • Models degrade not because they’re “wrong,” but because they’re optimizing for a world that no longer exists.
  3. Feature stores became less magical, more plumbing

    • Teams discovered the hard way:
      • Offline feature pipelines ≠ online features.
      • “Training data” snapshots rot quickly.
    • The conversation shifted from “adopt a feature store” to:
      • Do we have one source of truth for how each feature is computed?
      • Can we compute it consistently in batch and online?
      • Can we observe it over time?
  4. Inference cost became material

    • Especially for LLMs and heavy models:
      • $0.001 / request is not cheap at 10M daily requests.
      • Latency-sensitive workloads multiply infra cost (GPU overprovisioning, caching layers, etc.).
    • Execs are now asking:
      • “Why is this model 5x more expensive for +2% metric gain?”
      • “What’s the ROI of higher accuracy vs infra spend?”

How it works (simple mental model)

A robust production ML system is less about the model and more about loops:

  1. Input loop (feature + data pipeline)

    • Raw data → transformations → features → model input.
    • You need:
      • Versioned code for feature generation.
      • Distribution monitoring (mean, variance, categories, missingness).
      • Schema and contract checks.
  2. Prediction loop (inference + routing)

    • Input → model → output → downstream consumer.
    • You need:
      • Latency and error metrics (per model, per endpoint).
      • Cost metrics (tokens, FLOPs, hardware utilization).
      • Basic AB routing to test changes in the real world.
  3. Feedback loop (labels + human signals)

    • Prediction + outcome → labeled datapoint (possibly delayed).
    • You need:
      • Event correlation between prediction and eventual label.
      • Quality metrics over cohorts (by segment, time, geography).
      • A clear policy for what data becomes training data.
  4. Control loop (evaluation + retraining + rollback)

    • Monitor → detect issues → roll forward/back.
    • You need:
      • Guardrail metrics: “red lines” where you auto-rollback.
      • Shadow / canary deployments that compare old vs new.
      • Playbooks for incident response when ML goes bad.

If you don’t close one of these loops, you get:
– Unnoticed drift
– Silent performance regressions
– Cost blowups

Your mental model: ML in production is an adaptive control system. The model is just one component; the loops keep it safe and useful.


Where teams get burned (failure modes + anti-patterns)

1. Offline evaluation worship

Pattern:
– Model chosen on offline AUC / NDCG / BLEU.
– Shipped to prod.
– Assumed “good” until someone complains.

Failure modes:
– Metric doesn’t correlate with business impact.
– Evaluation dataset doesn’t cover long-tail or new segments.
– Temporal leakage: test set includes patterns from the future.

Example:
– A B2B SaaS team ships a lead-scoring model with great ROC-AUC.
– Sales later reports: “The model ignores strategic accounts that are rare in the historical data.”
– Offline eval never included those accounts as a separate cohort.

Mitigation:
– Define primary online metric (e.g., revenue / session, fraud loss).
– Track per-cohort performance (new users vs returning, regions, tiers).
– Treat offline metrics as sanity checks, not decision makers.


2. Feature pipelines that diverge

Pattern:
– Separate code paths for:
– Training data generation (SQL / Spark)
– Online serving (API / microservice logic)
– “Same feature” defined twice in different languages / repos.

Failure modes:
– Training/serving skew: the model sees different inputs in prod.
– Hard to debug: offline looks good, online performance is bad.
– Silent bugs when someone edits only one pipeline.

Example:
– A fintech team computes “30d transaction sum” in Spark for training and again in a Python service for online.
– A bug in the Python service truncates to 7 days.
– Model suddenly labels good customers as high risk.

Mitigation:
– Single source of truth for feature definitions (even if not a fancy feature store).
– Use shared libraries / code generation for feature transformations.
– Add training-serving skew checks:
– Compare feature distributions on sampled online data vs training data regularly.


3. Blindness to drift

Pattern:
– No input or output monitoring.
– Only aggregate business metrics tracked (revenue, churn).

Failure modes:
– Gradual degradation hidden by other product changes.
– Catastrophic model failures during distribution shifts (e.g., seasonality, promotions, policy changes).

Example:
– An e-commerce recommender works fine until a major holiday sale.
– New kinds of items flood the catalog; user behavior shifts.
– Recommendations lag behind, conversion drops.
– Root cause identified weeks later; no drift alert fired.

Mitigation:
– Monitor:
Input drift: distributions of key features vs baseline.
Output drift: prediction distribution, confidence scores, ranking skew.
– Alerts for:
– Sudden changes in missing value rates.
– New category values or schema violations.
– Significant KL divergence / PSI vs reference window.


4. Ignoring cost-performance trade-offs

Pattern:
– “Use the biggest model that fits the GPU.”
– Little tracking of:
– Latency SLOs vs user impact.
– Cost per request.
– ROI of quality improvements.

Failure modes:
– Unexpected infra bill spikes.
– Latency violations hurting conversion or UX.
– Difficulty justifying ML expansion to leadership.

Example:
– A support platform moves from a small classifier to an LLM-based triage system.
– Resolution time improves slightly, but GPU costs 10x.
– Without clear cost-per-ticket and win-rate metrics, leadership forces a rollback.

Mitigation:
– Track at least:
P95 latency, cost per request, and success metric (e.g., solved tickets, fraud prevented).
– Explicitly evaluate:
– “Is the +X% in quality worth the +Y% cost?”
– Consider tiered architectures:
– Cheap model for 80% of traffic.
– Expensive model as a fallback for hard cases.


5. No clear ownership or incident process

Pattern:
– ML as “best effort R&D”.
– Nobody owns on-call for models.
– Model bugs treated as “data quirks” rather than incidents.

Failure modes:
– Slow response to regressions.
– Finger-pointing between data, infra, and product teams.
– Accumulation of silent model debt.

Example:
– A content moderation model starts over-flagging benign posts after a data pipeline change.
– No dashboards; only support tickets as a signal.
– Issue persists for days because it’s nobody’s SLO.

Mitigation:
– Assign:
– Clear ownership for each model (team + on-call).
– SLOs and SLIs for model performance + infra metrics.
– Write runbooks:
– “What to check first when [metric] changes?”
– “When to rollback vs hotfix vs ignore?”


Practical playbook (what to do in the next 7 days)

You won’t fix ML observability in a week, but you can move from zero to “non-embarrassing.”

Day 1–2: Make the system visible

  1. Inventory your production models

    • For each model:
      • Endpoint / service name
      • Owner (team, on-call)
      • Primary business metric it influences
  2. Add basic metrics (if missing)

    • Per endpoint:
      • Request rate, error rate, P50/P95 latency
      • Cost proxy:
        • For LLMs: tokens / request.
        • For internal models: average GPU/CPU utilization.
  3. Snapshot current performance

    • For at least one model:
      • Recent labeled data (last 1–4 weeks).
      • Compute core quality metric (e.g., precision/recall, RMSE, calibration).

Day 3–4: Stand up minimal drift + skew checks

Pick one high-impact model and do the following:

  1. Define 5–10 “key features”

    • High importance in the model or domain-critical.
    • For each, compute:
      • Mean, std, min, max
      • Missing rate
      • Category frequencies (for categorical)
  2. Log feature snapshots in prod

    • On a sample of predictions (e.g., 1–5% of traffic).
    • Store:
      • Timestamp
      • Feature values
      • Prediction
      • Model version
  3. Compare to training/last-week baseline

    • Quick scripts to compute:
      • Differences in mean/std.
      • Drift scores (

Similar Posts