Your ML Inference Pipeline Is Not “Serverless” (And It’s Costing You Real Money)

Table of Contents

Why this matters right now

Most teams that “run ML on AWS” are actually running three different systems:

A training stack (often on EC2 or managed services)
A real-time inference API (maybe SageMaker, maybe ECS, maybe Lambda)
A batch / feature / data engineering system (Glue, EMR, Step Functions, Lambda, Fargate, whatever looked least painful at the time)

They’re loosely integrated, observability is partial, and the AWS bill is opaque. The result:

Inference is cheap-ish per call and mysteriously expensive per month
Latency SLOs drift as you layer on feature stores, model registries, and AB testing
Reliability is fragile, especially around deployment and rollback
Platform engineering becomes firefighting integration issues instead of improving capabilities

Serverless patterns on AWS (Lambda, Step Functions, EventBridge, DynamoDB, SQS, Kinesis) actually fit ML workloads well, if you design around their constraints. Used incorrectly, they produce:

Hidden idle cost (e.g., always-on containers you don’t need)
Unbounded concurrency blowups
Tracing gaps across training, feature computation, and inference

This post is about treating ML pipelines as first-class cloud engineering problems, not “special” snowflakes. The goal: predictable cost, predictable latency, and boring reliability.

What’s actually changed (not the press release)

Three concrete things have shifted over the last 2–3 years that matter for ML on AWS:

Serverless got good enough for most inference paths

AWS added:
- Larger Lambda sizes (CPU/RAM), arm64, and better cold start behavior
- Step Functions with lower pricing for high-volume workflows
- EventBridge with decent routing/fanout
- API Gateway and Lambda integration that’s less painful than it used to be
That means:
- Many “we need a container for this” arguments don’t hold for typical online inference
- You can run moderately heavy models (non-GPU) in Lambda with tolerable latency
GPU workloads are still not serverless – but their edges are

You are not going to run serious GPU inference in Lambda. But:
- Request routing, aggregation, fanout, feature materialization, and pre/post-processing are perfect for serverless.
- You can reduce your GPU fleet by moving everything except the actual matmul off the GPU nodes.
- Spot + autoscaling + a smart serverless front-door gets you 80% of the “serverless GPU” dream.
Platform engineering expectations have increased

Teams expect:
- One way to deploy ML services (training jobs, inference endpoints, batch pipelines)
- One way to observe them (traces/metrics/logs)
- One way to bill them (cost by team/model/feature)
You won’t get this using “whatever each team likes” on AWS. You need a platform opinion, and AWS’s building blocks are finally decent enough to standardize around:
- Lambda + Step Functions for orchestration
- SQS/Kinesis/EventBridge for decoupling
- DynamoDB/S3 as primary state/backing
- CloudWatch/X-Ray/embedded metrics for observability (or your own stack, but wired consistently)

How it works (simple mental model)

Instead of thinking in “ML” terms, think in three planes:

Control Plane: Training, deployment, configuration
Data Plane: Online inference and batch scoring
Observability Plane: Metrics/traces/logs + model performance signals

A pragmatic AWS/serverless partition looks like this:

1. Control Plane (slow, stateful, low-volume)

Tech: Step Functions, Lambda, CodeBuild/CodePipeline, SageMaker training jobs or ECS
Owns:
- Kicking off training jobs
- Registering models (S3 + DynamoDB as a registry is often enough)
- Creating/updating inference endpoints (ECS services, SageMaker endpoints, or Lambda aliases)
- Running canary / shadow / AB rollout workflows

Mental model: A small set of state machines (Step Functions) represent:

TrainModel → Evaluate → Register → DeployCanary → ShiftTraffic → Finalize
RollbackEndpoint with automated criteria

These are low-frequency workflows; serverless overhead doesn’t matter.

2. Data Plane (fast, stateless, high-volume)

Split this into real-time and batch:

Real-time API:

Tech:
- API Gateway (or ALB) → Lambda for light/CPU inference
- API Gateway → Lambda → SQS/Kinesis → ECS/SageMaker for heavy/GPU inference
Pattern:
- Lambda as the first hop: auth, validation, cheap feature fetching
- Dispatch to the right backend based on:
  - Model type
  - Latency SLO
  - Payload size

Batch / offline scoring:

Tech: EventBridge scheduled rules, Step Functions, Lambda, Fargate/ECS tasks, S3
Pattern:
- EventBridge triggers a Step Function that:
  - Reads offsets/partitions from S3 or your lake
  - Fans out work into parallel map tasks (Lambda for light CPU; ECS/Fargate for heavy jobs)
  - Writes results back to S3 / a warehouse / a feature store

Mental model: The data plane is a small set of composable paths:

“Synchronous request/response”
“Async high-throughput jobs”
“Periodic batch windows”

Each path is standardized; models plug into them.

3. Observability Plane (shared, boring, enforced)

Tech: CloudWatch metrics, logs, X-Ray or your tracing stack, plus maybe DynamoDB/S3 for analytic events
You standardize:
- Correlation IDs from API entry through model invocation
- Structured log schema (request ID, model ID, version, latency, response code, customer/tenant)
- A handful of default SLOs:
  - p95 latency per endpoint
  - error rate per endpoint
  - cost per 1k requests (approximate)
  - model-specific metrics (e.g., click-through, fraud catch rate) piped asynchronously

Mental model: Every path (control & data) emits the same basic telemetry. You can drill into any model the same way.

Where teams get burned (failure modes + anti-patterns)

1. “We’ll just put it on SageMaker and it’s fine”

What happens:

Teams pick managed endpoints and call it “done”
Feature fetching, orchestration, and pre/post-processing end up embedded in the model container
You can’t reuse that logic elsewhere (batch scoring, experiments, shadow modes)
Endpoint scaling hides inefficient code with more instances instead of forcing refactors

Result: High per-hour instance cost and a brittle deployment pipeline.

When SageMaker or ECS endpoints make sense:

Heavy models
Need GPUs
Strong isolation/SLA for a small number of endpoints

Otherwise: use serverless functions for everything except the heavy matmul.

2. Concurrency blowups on Lambda (inference stampede)

Common pattern:

One noisy tenant / partner / cron job spikes traffic
Lambda concurrency auto-scales aggressively
Downstream:
- DynamoDB gets throttled
- SQS queues explode
- Feature stores / RDS connections hit limits

Mitigations:

Per-tenant throttling at API Gateway
Reserved concurrency on critical Lambdas
SQS or Kinesis between Lambda and any non-elastic backend
DynamoDB capacity planning and adaptive capacity turned on; per-partition key design

Without this, a single inference stampede can take down unrelated workloads.

3. Half-a-platform: everyone rolls their own workflow

You see:

One team uses Step Functions, another uses Airflow-on-EC2, another uses cron in a container
Different ways to:
- Put models in production
- Rollback
- Capture metrics
Shared components (e.g., feature pipelines) are glued with ad-hoc scripts

This is how “ML platform” slowly becomes “bundle of tech debt”.

Fix:

Declare a minimal platform:
- “We deploy via X”
- “We orchestrate via Y”
- “We log/trace via Z”
Everything else is “you own it”. But those three are mandatory.

4. Treating ML latency budgets as infinite

Frequent issue:

Online inference path: gateway → auth → feature store → model → post-processing → database write
Each hop adds 20–80 ms
No single owner for end-to-end latency

Serverless makes this worse if you:

Ignore cold starts (un-tuned Lambdas)
Call other Lambdas synchronously (chaining) instead of using Step Functions or batched calls
Don’t instrument p95/p99

You must set explicit budgets:

e.g., 150 ms total → 40 ms feature fetch, 60 ms model, 50 ms everything else
Build dashboards to map budgets; fail builds if new code exceeds budget.

5. Hidden cost in “ops glue” services

Not obvious culprits:

Always-on ECS/Fargate tasks polling queues that could be Lambda triggers
Step Functions with extremely chatty workflows doing trivial work in 100s of states
“Helper” EC2 instances running one-off scripts or model conversions that could be short-lived jobs

Over a year, these out-of-sight resources often cost more than your main inference endpoints.

Practical playbook (what to do in the next 7 days)

You don’t need a grand “ML platform” redesign. Use this as a short, incremental audit.

Day 1–2: Inventory your ML data plane

Capture, in a shared doc:

All real-time inference endpoints:
- Entry point (API Gateway, ALB, direct)
- Backend type (Lambda / ECS / SageMaker / other)
- Average QPS, p95 latency, 30-day cost
All batch pipelines that run models:
- Trigger (time-based? event-based?)
- Orchestrator (Step Functions? Airflow? Cron?)
- Runtime (Lambda/ECS/EMR/etc.)

Goal: find three biggest cost centers and three highest-latency user-facing paths.

Day 3: Draw the inference path diagrams

For the top 2–3 user-facing inference paths:

Draw actual call graph:
- Gateway → Auth → Service A → Service B → Model → DB/Cache/etc.
For each hop:
- Note tech (Lambda/ECS/etc.)
- Typical latency
- Retry behavior

Ask:

Which hops could be serverless Lambdas instead of always-on containers?
Which can be merged to reduce hops?
Where do you have sync chains of Lambdas that should be a Step Function?

Day 4: Set basic SLOs + observability guardrails

Define, per endpoint:

p95 latency target
Error rate target (e.g., <0.5%)
Max acceptable monthly cost

Implement:

A standard logging format (if you don’t already have one) with:
- traceid / requestid
- modelname / version
- latencyms
- status
A CloudWatch dashboard per critical endpoint:
- Latency distribution
- Error rate
- Lambda concurrency or instance count

This alone often reveals obvious misconfigurations (e.g., default timeouts, hot code paths).

Day 5–6: Pick one candidate for serverless refactor

Choose something:

Non-GPU
Medium traffic
Painful to maintain but not mission critical

Refactor pattern examples:

Container → Lambda front door
- Keep the model in ECS/SageMaker
- Move:
  - Auth
  - Input validation
  - Feature fetching and caching
- Into a Lambda front door behind API Gateway
Immediate benefits:
- Reduce load on the model service
- Measure “everything but the model” separately
- Easier to evolve feature logic
Batch EC2 script → Step Functions + Lambda
- If you have a nightly EC2 script running a model:
  - Replace with EventBridge schedule → Step Function
  - Step Function fans out to Lambdas (if CPU) or ECS tasks (if heavy) for parallelism
Benefits:
- Automatic retries / backoff
- Visibility into which partitions/jobs fail
- Easy to scale out without re-architecting

Day 7: Plan the “minimum viable platform” for ML

With what you learned, define these hard decisions:

For new ML workloads:
- Orchestration: “We use X” (often Step Functions)
- Online entry point: API Gateway or ALB + Lambda, unless GPU-heavy
- Observability: supported metrics, logs, traces
Decommission targets:
- At least 1–2 always-on EC2/ECS “glue” tasks to be retired in the next month
Reliability guardrails:
- Per-tenant throttling for inference
- Reserved concurrency for critical Lambdas
- Default timeouts + retries policies

Write this up in 1–2 pages. That’s your ML platform charter, even if you don’t call it that.

Bottom line

ML on AWS doesn’t need a new buzzword platform. It needs normal cloud engineering discipline applied to training, inference, and data pipelines:

Use serverless for everything that doesn’t absolutely require a container or GPU.
Treat the ML stack as three planes—control, data, observability—and standardize each.
Watch for hidden costs in glue code and over-provisioned endpoints.
Put latency and cost SLOs on models the same way you do for any other production service.

If you do this, “ML in production” stops being a science experiment and becomes what it should be: just another well-understood workload on your AWS platform—with understandable trade-offs, predictable reliability, and a bill that matches reality.