Your ML Inference Pipeline Is Not “Serverless” (And It’s Costing You Real Money)
Why this matters right now
Most teams that “run ML on AWS” are actually running three different systems:
- A training stack (often on EC2 or managed services)
- A real-time inference API (maybe SageMaker, maybe ECS, maybe Lambda)
- A batch / feature / data engineering system (Glue, EMR, Step Functions, Lambda, Fargate, whatever looked least painful at the time)
They’re loosely integrated, observability is partial, and the AWS bill is opaque. The result:
- Inference is cheap-ish per call and mysteriously expensive per month
- Latency SLOs drift as you layer on feature stores, model registries, and AB testing
- Reliability is fragile, especially around deployment and rollback
- Platform engineering becomes firefighting integration issues instead of improving capabilities
Serverless patterns on AWS (Lambda, Step Functions, EventBridge, DynamoDB, SQS, Kinesis) actually fit ML workloads well, if you design around their constraints. Used incorrectly, they produce:
- Hidden idle cost (e.g., always-on containers you don’t need)
- Unbounded concurrency blowups
- Tracing gaps across training, feature computation, and inference
This post is about treating ML pipelines as first-class cloud engineering problems, not “special” snowflakes. The goal: predictable cost, predictable latency, and boring reliability.
What’s actually changed (not the press release)
Three concrete things have shifted over the last 2–3 years that matter for ML on AWS:
-
Serverless got good enough for most inference paths
AWS added:
- Larger Lambda sizes (CPU/RAM), arm64, and better cold start behavior
- Step Functions with lower pricing for high-volume workflows
- EventBridge with decent routing/fanout
- API Gateway and Lambda integration that’s less painful than it used to be
That means:
- Many “we need a container for this” arguments don’t hold for typical online inference
- You can run moderately heavy models (non-GPU) in Lambda with tolerable latency
-
GPU workloads are still not serverless – but their edges are
You are not going to run serious GPU inference in Lambda. But:
- Request routing, aggregation, fanout, feature materialization, and pre/post-processing are perfect for serverless.
- You can reduce your GPU fleet by moving everything except the actual matmul off the GPU nodes.
- Spot + autoscaling + a smart serverless front-door gets you 80% of the “serverless GPU” dream.
-
Platform engineering expectations have increased
Teams expect:
- One way to deploy ML services (training jobs, inference endpoints, batch pipelines)
- One way to observe them (traces/metrics/logs)
- One way to bill them (cost by team/model/feature)
You won’t get this using “whatever each team likes” on AWS. You need a platform opinion, and AWS’s building blocks are finally decent enough to standardize around:
- Lambda + Step Functions for orchestration
- SQS/Kinesis/EventBridge for decoupling
- DynamoDB/S3 as primary state/backing
- CloudWatch/X-Ray/embedded metrics for observability (or your own stack, but wired consistently)
How it works (simple mental model)
Instead of thinking in “ML” terms, think in three planes:
- Control Plane: Training, deployment, configuration
- Data Plane: Online inference and batch scoring
- Observability Plane: Metrics/traces/logs + model performance signals
A pragmatic AWS/serverless partition looks like this:
1. Control Plane (slow, stateful, low-volume)
- Tech: Step Functions, Lambda, CodeBuild/CodePipeline, SageMaker training jobs or ECS
- Owns:
- Kicking off training jobs
- Registering models (S3 + DynamoDB as a registry is often enough)
- Creating/updating inference endpoints (ECS services, SageMaker endpoints, or Lambda aliases)
- Running canary / shadow / AB rollout workflows
Mental model: A small set of state machines (Step Functions) represent:
TrainModel → Evaluate → Register → DeployCanary → ShiftTraffic → FinalizeRollbackEndpointwith automated criteria
These are low-frequency workflows; serverless overhead doesn’t matter.
2. Data Plane (fast, stateless, high-volume)
Split this into real-time and batch:
Real-time API:
- Tech:
- API Gateway (or ALB) → Lambda for light/CPU inference
- API Gateway → Lambda → SQS/Kinesis → ECS/SageMaker for heavy/GPU inference
- Pattern:
- Lambda as the first hop: auth, validation, cheap feature fetching
- Dispatch to the right backend based on:
- Model type
- Latency SLO
- Payload size
Batch / offline scoring:
- Tech: EventBridge scheduled rules, Step Functions, Lambda, Fargate/ECS tasks, S3
- Pattern:
- EventBridge triggers a Step Function that:
- Reads offsets/partitions from S3 or your lake
- Fans out work into parallel map tasks (Lambda for light CPU; ECS/Fargate for heavy jobs)
- Writes results back to S3 / a warehouse / a feature store
- EventBridge triggers a Step Function that:
Mental model: The data plane is a small set of composable paths:
- “Synchronous request/response”
- “Async high-throughput jobs”
- “Periodic batch windows”
Each path is standardized; models plug into them.
3. Observability Plane (shared, boring, enforced)
- Tech: CloudWatch metrics, logs, X-Ray or your tracing stack, plus maybe DynamoDB/S3 for analytic events
- You standardize:
- Correlation IDs from API entry through model invocation
- Structured log schema (request ID, model ID, version, latency, response code, customer/tenant)
- A handful of default SLOs:
- p95 latency per endpoint
- error rate per endpoint
- cost per 1k requests (approximate)
- model-specific metrics (e.g., click-through, fraud catch rate) piped asynchronously
Mental model: Every path (control & data) emits the same basic telemetry. You can drill into any model the same way.
Where teams get burned (failure modes + anti-patterns)
1. “We’ll just put it on SageMaker and it’s fine”
What happens:
- Teams pick managed endpoints and call it “done”
- Feature fetching, orchestration, and pre/post-processing end up embedded in the model container
- You can’t reuse that logic elsewhere (batch scoring, experiments, shadow modes)
- Endpoint scaling hides inefficient code with more instances instead of forcing refactors
Result: High per-hour instance cost and a brittle deployment pipeline.
When SageMaker or ECS endpoints make sense:
- Heavy models
- Need GPUs
- Strong isolation/SLA for a small number of endpoints
Otherwise: use serverless functions for everything except the heavy matmul.
2. Concurrency blowups on Lambda (inference stampede)
Common pattern:
- One noisy tenant / partner / cron job spikes traffic
- Lambda concurrency auto-scales aggressively
- Downstream:
- DynamoDB gets throttled
- SQS queues explode
- Feature stores / RDS connections hit limits
Mitigations:
- Per-tenant throttling at API Gateway
- Reserved concurrency on critical Lambdas
- SQS or Kinesis between Lambda and any non-elastic backend
- DynamoDB capacity planning and adaptive capacity turned on; per-partition key design
Without this, a single inference stampede can take down unrelated workloads.
3. Half-a-platform: everyone rolls their own workflow
You see:
- One team uses Step Functions, another uses Airflow-on-EC2, another uses cron in a container
- Different ways to:
- Put models in production
- Rollback
- Capture metrics
- Shared components (e.g., feature pipelines) are glued with ad-hoc scripts
This is how “ML platform” slowly becomes “bundle of tech debt”.
Fix:
- Declare a minimal platform:
- “We deploy via X”
- “We orchestrate via Y”
- “We log/trace via Z”
- Everything else is “you own it”. But those three are mandatory.
4. Treating ML latency budgets as infinite
Frequent issue:
- Online inference path: gateway → auth → feature store → model → post-processing → database write
- Each hop adds 20–80 ms
- No single owner for end-to-end latency
Serverless makes this worse if you:
- Ignore cold starts (un-tuned Lambdas)
- Call other Lambdas synchronously (chaining) instead of using Step Functions or batched calls
- Don’t instrument p95/p99
You must set explicit budgets:
- e.g., 150 ms total → 40 ms feature fetch, 60 ms model, 50 ms everything else
- Build dashboards to map budgets; fail builds if new code exceeds budget.
5. Hidden cost in “ops glue” services
Not obvious culprits:
- Always-on ECS/Fargate tasks polling queues that could be Lambda triggers
- Step Functions with extremely chatty workflows doing trivial work in 100s of states
- “Helper” EC2 instances running one-off scripts or model conversions that could be short-lived jobs
Over a year, these out-of-sight resources often cost more than your main inference endpoints.
Practical playbook (what to do in the next 7 days)
You don’t need a grand “ML platform” redesign. Use this as a short, incremental audit.
Day 1–2: Inventory your ML data plane
Capture, in a shared doc:
- All real-time inference endpoints:
- Entry point (API Gateway, ALB, direct)
- Backend type (Lambda / ECS / SageMaker / other)
- Average QPS, p95 latency, 30-day cost
- All batch pipelines that run models:
- Trigger (time-based? event-based?)
- Orchestrator (Step Functions? Airflow? Cron?)
- Runtime (Lambda/ECS/EMR/etc.)
Goal: find three biggest cost centers and three highest-latency user-facing paths.
Day 3: Draw the inference path diagrams
For the top 2–3 user-facing inference paths:
- Draw actual call graph:
- Gateway → Auth → Service A → Service B → Model → DB/Cache/etc.
- For each hop:
- Note tech (Lambda/ECS/etc.)
- Typical latency
- Retry behavior
Ask:
- Which hops could be serverless Lambdas instead of always-on containers?
- Which can be merged to reduce hops?
- Where do you have sync chains of Lambdas that should be a Step Function?
Day 4: Set basic SLOs + observability guardrails
Define, per endpoint:
- p95 latency target
- Error rate target (e.g., <0.5%)
- Max acceptable monthly cost
Implement:
- A standard logging format (if you don’t already have one) with:
- traceid / requestid
- modelname / version
- latencyms
- status
- A CloudWatch dashboard per critical endpoint:
- Latency distribution
- Error rate
- Lambda concurrency or instance count
This alone often reveals obvious misconfigurations (e.g., default timeouts, hot code paths).
Day 5–6: Pick one candidate for serverless refactor
Choose something:
- Non-GPU
- Medium traffic
- Painful to maintain but not mission critical
Refactor pattern examples:
-
Container → Lambda front door
- Keep the model in ECS/SageMaker
- Move:
- Auth
- Input validation
- Feature fetching and caching
- Into a Lambda front door behind API Gateway
Immediate benefits:
- Reduce load on the model service
- Measure “everything but the model” separately
- Easier to evolve feature logic
-
Batch EC2 script → Step Functions + Lambda
- If you have a nightly EC2 script running a model:
- Replace with EventBridge schedule → Step Function
- Step Function fans out to Lambdas (if CPU) or ECS tasks (if heavy) for parallelism
Benefits:
- Automatic retries / backoff
- Visibility into which partitions/jobs fail
- Easy to scale out without re-architecting
- If you have a nightly EC2 script running a model:
Day 7: Plan the “minimum viable platform” for ML
With what you learned, define these hard decisions:
- For new ML workloads:
- Orchestration: “We use X” (often Step Functions)
- Online entry point: API Gateway or ALB + Lambda, unless GPU-heavy
- Observability: supported metrics, logs, traces
- Decommission targets:
- At least 1–2 always-on EC2/ECS “glue” tasks to be retired in the next month
- Reliability guardrails:
- Per-tenant throttling for inference
- Reserved concurrency for critical Lambdas
- Default timeouts + retries policies
Write this up in 1–2 pages. That’s your ML platform charter, even if you don’t call it that.
Bottom line
ML on AWS doesn’t need a new buzzword platform. It needs normal cloud engineering discipline applied to training, inference, and data pipelines:
- Use serverless for everything that doesn’t absolutely require a container or GPU.
- Treat the ML stack as three planes—control, data, observability—and standardize each.
- Watch for hidden costs in glue code and over-provisioned endpoints.
- Put latency and cost SLOs on models the same way you do for any other production service.
If you do this, “ML in production” stops being a science experiment and becomes what it should be: just another well-understood workload on your AWS platform—with understandable trade-offs, predictable reliability, and a bill that matches reality.
