Your ML Platform Is Not Special: Serverless Patterns That Actually Work on AWS
Why this matters right now
Most ML teams on AWS are stuck in an awkward middle state:
- The infra looks like 2015 (manually managed EC2, EKS clusters, ad‑hoc cron boxes).
- The workloads look like 2024 (LLM inference, feature stores, streaming feature pipelines).
- The bills look like a rounding error… until you hit product‑market fit.
The result:
- Training jobs that stall or fail at 70% without clear debugging hooks.
- “Real‑time” inference that is actually 800ms p95 with random timeouts.
- Platform teams spending more time on patching and capacity planning than on ML ergonomics.
- CFOs asking why the “AI line item” is both expensive and correlated with outages.
Serverless on AWS is not new. But applying serverless patterns to ML workloads is still poorly understood. The marketing suggests “infinite scale” and “pay per use.” The reality is:
- Many ML systems are over‑provisioned where they shouldn’t be.
- The critical paths are under‑provisioned where it matters.
- Observability is an afterthought, so nobody knows why costs or latencies spike.
This post covers how to design AWS‑native, serverless‑oriented ML architectures that:
- Contain blast radius for cost and reliability.
- Provide traceable data and inference paths.
- Are boring enough that you can actually operate them.
What’s actually changed (not the press release)
Three shifts in AWS make serverless‑first ML architectures viable today in a way that wasn’t true 3–4 years ago:
-
Mature managed data planes
- S3, DynamoDB, Kinesis, EventBridge, and Step Functions are no longer rough‑edge betas.
- IAM integration, cross‑account patterns, and failure semantics are well‑understood.
- For ML, this means you can treat “data moving and orchestration” as mostly solved, and focus on the model.
-
Reasonable GPU and model‑serving options
- ECS/EKS with managed GPU instances + autoscaling can be wrapped in serverless edges:
- API Gateway → Lambda → SQS → GPU worker pool.
- Serverless-ish options (e.g., Fargate, Lambda with container images, on‑demand endpoints in managed ML services) now exist with:
- Autoscaling that you can script.
- Acceptable cold start characteristics for many classes of workloads.
- ECS/EKS with managed GPU instances + autoscaling can be wrapped in serverless edges:
-
Traces and metrics are no longer optional
- CloudWatch Logs, metrics, and X‑Ray are usable at scale (if you design for them).
- Structured logging + distributed tracing across Lambda, Step Functions, and containers can give you:
- Per‑request lineage from input → feature fetch → model → post‑processing.
- Reliable cost attribution per service or feature.
None of this is “new” in the press‑release sense. What’s new is the combination:
- High‑quality managed primitives + OK model serving options + usable observability.
- This lets you build ML platforms that don’t require a kubernetes SRE team just to keep the lights on.
How it works (simple mental model)
Think in three planes: data, inference, and control. Use serverless for coordination and scale edges; use “just enough” managed compute where tight control is needed.
1. Data plane: immutable, event‑driven, boring
Core principles:
- Immutable storage: S3 as the source of truth for training and batch inference inputs/outputs.
- Event‑driven enrichment: Use Lambda to respond to data events (S3, Kinesis, DynamoDB Streams).
- Versioning everywhere: Bucket paths and table keys include datasetid, modelversion, and run_id.
Concrete patterns:
- S3
PUT→ EventBridge → Step Functions → Lambda jobs that:- Validate schema.
- Compute derived features.
- Write to a feature store (often DynamoDB or a columnar store on S3).
- “Feature at request time”:
- API Gateway → Lambda → DynamoDB
BatchGetItemfor features → model inference endpoint.
- API Gateway → Lambda → DynamoDB
This keeps data movement explicit and audit‑able. It also decouples ingestion, processing, and training.
2. Inference plane: managed edges + bounded heavy compute
Distinguish between:
- Latency‑sensitive, high QPS inference (fraud checks, ranking, personalization).
- Throughput‑oriented or batch inference (daily recommendations, risk scores).
Patterns:
-
Latency‑sensitive:
- API Gateway → Lambda (auth, feature lookup, input validation, logging) →
- Internal HTTP call to a long‑lived model server running on ECS/EKS/generic endpoint with:
- Autoscaling based on request count + CPU/GPU + latency.
- Circuit breakers and timeouts at the Lambda boundary.
-
Batch / async:
- S3 manifest or Kinesis stream → SQS → Lambda workers or Fargate jobs that:
- Pull batches of items, run inference, write results to S3/DynamoDB.
- Report progress via CloudWatch metrics and Step Functions.
- S3 manifest or Kinesis stream → SQS → Lambda workers or Fargate jobs that:
The mental model:
- Use serverless for spiky edges and orchestration.
- Use fixed or auto‑scaled pools for hot model execution.
- Keep the boundary between them explicit (queues, APIs, or events).
3. Control plane: workflows and experiments
Training, hyper‑parameter search, retraining:
- Triggered by:
- Data freshness (new labeled data arrives).
- Performance drift (metrics cross thresholds).
- Orchestrated via Step Functions or a workflow engine sitting on top of it.
Key ideas:
- Training jobs themselves may run on:
- Managed ML service training jobs, or
- ECS/EKS batch jobs, or
- Spot fleets with checkpointing to S3.
- All of that is wrapped in a serverless workflow:
- Validate data → spin up training → monitor → evaluate metrics → conditionally deploy.
You can reason about the ML system as:
- Events and workflows (serverless) coordinating
- Heavy compute (containers/VMs/managed jobs) with clear SLAs.
Where teams get burned (failure modes + anti-patterns)
1. “All‑Lambda‑everything” for ML
Misstep:
- Trying to run training, feature pipelines, and inference entirely on Lambda for “fully serverless.”
Problems:
- Time limits, memory caps, no GPUs.
- Hard to debug and profile.
- Cost explodes when you’re doing CPU‑heavy work in 15‑minute slots.
Smell:
- Single Lambda with 3000+ lines of code doing ETL + training + evaluation.
Better:
- Use Lambda for orchestration and light preprocessing.
- Put real workloads on ECS/Fargate/batch.
2. Ignoring data volume and access patterns
Misstep:
- Putting “features” into DynamoDB because it’s “fast,” then querying them with unbounded scans.
Problems:
- Hot partitions, throttling, unpredictable latency.
- Massive cost when QPS or data scales.
Smell:
- DynamoDB tables with generic keys like
user_idand no real partitioning strategy.
Better:
- Use DynamoDB only for low‑latency, key‑based lookups.
- Use S3 + columnar formats (Parquet) + query engines for analytics and bulk feature computation.
- Cache hot features in memory (ECS service sidecar cache, or in‑process maps) where appropriate.
3. No per‑request observability
Misstep:
- Only aggregate metrics (e.g., average latency) and raw CloudWatch logs.
Problems:
- You cannot debug “this inference was wrong/slow/expensive.”
- Hard to correlate input distributions with failure or cost patterns.
Smell:
- No correlation ID that flows from API Gateway → Lambda → model server → data store.
Better:
- Generate a
trace_idat the entrypoint (API Gateway or Step Functions). - Pass it through:
- Logs (structured JSON).
- Metrics (as dimensions if cardinality is manageable).
- Downstream service calls.
- Sample payloads and model outputs with trace IDs into a debug bucket (GDPR/PII aware).
4. Misaligned autoscaling policies
Misstep:
- Relying on CPU utilization only for autoscaling your model servers.
- Not coordinating concurrency limits between API Gateway, Lambda, and downstream pools.
Problems:
- Thundering herds on cold start.
- Cascading failures when one tier scales faster than another.
Smell:
- Spiky p95 latency with no corresponding CPU/GPU spike.
Better:
- Autoscale on:
- Request count, concurrent connections, or custom “queue length” metrics.
- Latency SLOs for inference endpoints.
- Set concurrency limits at Lambda and API Gateway:
- Keep them in line with what your model pool can handle.
- Use queue buffers (SQS) where latency constraints allow.
Practical playbook (what to do in the next 7 days)
Assuming you already run ML workloads on AWS and want to move toward a reliable, cost‑aware serverless pattern.
Day 1–2: Draw the real system
- Whiteboard (or tool of choice):
- Entry points: APIs, batch triggers, cron jobs.
- Data stores: S3 buckets, databases, caches.
- Compute: Lambdas, ECS/EKS, training jobs.
- Mark each edge as:
- Synchronous (RPC) or asynchronous (SQS, events).
- For each inference path, note:
- p95 latency requirement.
- Availability/SLO.
- Current cost estimate (even rough).
Outcome: a map of your current AWS ML architecture.
Day 3: Identify the three worst offenders
Pick:
- The slowest critical inference path.
- The costliest recurring ML workload (training or batch inference).
- The least observable pipeline (where incidents are black boxes).
For each, ask:
- Is the hot code path running in the right place?
- Is the coordination serverless, or are we manually managing machines where we don’t need to?
Day 4–5: Implement one boundary correctly
Choose one of the offenders and:
- Introduce an explicit boundary:
- Example: API Gateway → Lambda → SQS → ECS GPU workers for LLM inference.
- Add basic observability:
- Generate
trace_idat the API. - Structured logging in Lambda and ECS.
- CloudWatch dashboards:
- Request rate, errors, latency, queue depth, worker utilization.
- Generate
Goal:
- You can answer: “What happens when QPS doubles?” with more than a shrug.
Day 6: Cost visibility by component
Add simple, low‑effort tagging/attribution:
- Tag resources with
service,team, andenv. - For one workload, break down cost by:
- Data storage (S3, DB).
- Coordination (Lambda, Step Functions).
- Compute (ECS/EKS, training jobs, endpoints).
- Compare to business metrics:
- Cost per 1k inferences.
- Cost per training run.
- Cost per daily batch re‑scoring.
This doesn’t have to be perfect; you just need directional signal.
Day 7: Decide on a north‑star pattern
Based on what you saw:
- Pick one standard pattern for each:
- Real‑time inference.
- Batch inference.
- Training and retraining.
- Document them:
- Which AWS services.
- Default observability requirements.
- Default autoscaling policy baseline.
- Make them the “paved road”:
- New ML services must justify deviating.
Bottom line
ML on AWS doesn’t need bespoke, artisanal infrastructure.
The patterns are straightforward:
- Use serverless for orchestration, spiky edges, and control planes.
- Use managed compute pools for heavy, stateful, or GPU‑bound work.
- Make data and traces first‑class: version everything, log everything (within reason), and propagate correlation IDs.
Teams that get this right:
- Ship models faster because they’re not reinventing infra.
- Contain cost because they can reason about who pays for what.
- Improve reliability because failures are localized to well‑defined boundaries.
If your current ML platform feels fragile, expensive, or opaque, the issue is usually not “we need a new tool.” It’s that the architecture ignores the basic serverless patterns AWS actually supports well.
Start by drawing the real system, then move one boundary at a time toward an explicit, observable, serverless‑coordinated design.
