Your ML Platform Doesn’t Need Kubernetes: Doing Real MLOps on Boring AWS Serverless
Why this matters right now
Most teams deploying machine learning on AWS are quietly stuck in the same trap:
- Training happens in ad‑hoc notebooks and EC2 boxes.
- Inference runs in a “temporary” Docker container that became permanent.
- Data pipelines are stitched together with whatever the first engineer shipped.
Meanwhile, infra costs and operational noise creep up:
- GPUs are idle 80% of the time but always on.
- Spiky inference traffic forces over‑provisioning.
- Debugging a bad model output requires 4 people and 5 tools.
And then someone says: “We need a real ML platform. Let’s stand up a Kubernetes cluster.”
You probably don’t.
For a lot of production ML workloads, you can get:
- Lower cost (especially at low–medium scale)
- Higher reliability with fewer moving parts
- Better observability tied into your existing AWS stack
- Less platform engineering headcount
by treating AWS’ “boring” serverless and managed services as your ML substrate: Lambda, Step Functions, EventBridge, SQS, DynamoDB, Aurora Serverless, API Gateway, CloudWatch, plus selective use of SageMaker where it makes sense.
This isn’t about “no ops” or chasing the latest managed AI service. It’s about using cloud engineering fundamentals to build ML systems that:
- Can be operated by your existing DevOps/SRE team.
- Have clear unit economics.
- Fail in predictable, observable ways.
What’s actually changed (not the press release)
A lot of people’s mental model of “serverless on AWS” is still 2018 Lambda + API Gateway. For ML, that used to be painful:
- Cold starts were brutal for Python + bigger packages.
- Limited memory & CPU made anything non‑trivial awkward.
- No good story for GPUs or large batch workloads.
The ecosystem and platform have moved:
-
Lambda is now viable for more ML inference than you think
- Up to 10 GB memory and proportionally more CPU.
- SnapStart for Java (and similar cold‑start patterns for others with provisioned concurrency).
- 15‑minute max runtime handles many “online but heavy” predictions.
- Graviton2/3 runtimes with better price/perf.
-
Step Functions + EventBridge have matured into a workflow backbone
- Native integration with most AWS data & ML services.
- Express Workflows for high‑volume, lower‑latency orchestration.
- Sane visualisation and execution history for distributed flows.
-
SageMaker became modular instead of all‑or‑nothing
- You can cherry‑pick features like:
- Training jobs on managed infrastructure.
- Batch Transform for large offline inference.
- Model Registry without buying into the full Studio+Pipelines story.
- Integration patterns with EventBridge and Step Functions are much better.
- You can cherry‑pick features like:
-
Storage and data engineering tools are friendlier to ML workloads
- Aurora Serverless v2 is viable for low‑ops, high‑availability feature stores.
- DynamoDB On‑Demand simplifies low‑traffic inference stores.
- Glue + Lake Formation stabilize the data lake story enough for many teams.
So the realistic 2026 baseline is:
- Training & batch scoring: SageMaker jobs + Step Functions.
- Online inference: Lambda or Fargate, fronted by API Gateway or an ALB.
- Pipelines: Step Functions + EventBridge + S3 + Glue.
- Metadata and features: RDS/Aurora or DynamoDB, sometimes plus a Parquet+S3 lake.
You can build this without touching a single Kubernetes manifest.
How it works (simple mental model)
Use this mental model for “serverless ML on AWS”:
-
Split your ML into three planes
- Data plane – where data flows:
- Storage (S3, Aurora Serverless, DynamoDB)
- Streaming (Kinesis, sometimes MSK)
- Compute plane – where work runs:
- Lambda for short‑lived inference & glue code
- SageMaker training jobs / Batch Transform
- Fargate or ECS on Fargate for long‑running or GPU services
- Control plane – what orchestrates and observes:
- Step Functions, EventBridge, SQS
- CloudWatch + X‑Ray + structured logging
Think: “data flows, compute reacts, control coordinates.”
- Data plane – where data flows:
-
Derive patterns from latency + state, not from tools
For each ML use case, answer:
- How fast must the prediction be?
- How big is the request/response payload?
- How much state does the model need per call?
Then pick patterns:
-
Synchronous, low‑latency inference (< 300ms p95)
- Pattern: API Gateway → Lambda → Model in memory / EFS.
- Use for: recommendation, personalization, fraud checks at checkout.
-
Synchronous, medium latency (seconds)
- Pattern: API Gateway → SQS → Lambda worker pool → callback/webhook.
- Use for: document classification, image processing, complex feature computation.
-
Asynchronous batch inference
- Pattern: S3 event → Step Functions → SageMaker Batch Transform.
- Use for: nightly scoring, churn prediction, bulk customer updates.
-
Training & retraining
- Pattern: EventBridge schedule or data trigger → Step Functions → SageMaker Training → Model Registry → deployment workflow.
-
Treat models like build artefacts
- Build pipeline outputs:
- A versioned model file (S3 or SageMaker model package).
- An immutable inference container (ECR).
- A contract (OpenAPI spec, JSON schema).
- Deployment pipeline:
- Roll out to a “candidate” stage (Lambda alias, canary, or weighted target group).
- Shadow or partial traffic.
- Promote on metrics (latency, error rate, business KPIs).
- Build pipeline outputs:
-
Observability is the product, not the addon
Minimum for each model:
- Structured logs (JSON) including:
- modelversion
- featureflags
- inputschemaversion
- latency buckets
- Metrics:
- p50/p95 latency, error rates, throughput.
- Cold‑start count (for Lambda).
- Traces:
- Correlate model calls with upstream request IDs.
- Drift & quality (where feasible):
- Simple distribution checks of key features.
- Label‑delayed performance (e.g., CTR, conversion).
- Structured logs (JSON) including:
Where teams get burned (failure modes + anti-patterns)
1. “We’ll standardize on Kubernetes… and then never finish standardizing”
Pattern:
- Every ML team builds their own Helm charts.
- Custom sidecars for metrics, feature fetching, secret management.
- You now run:
- The cluster.
- The platform on the cluster.
- The models on the platform on the cluster.
Typical outcome:
- Infra team spends all its time on cluster upgrades.
- ML teams still don’t have self‑service deployment.
- Cost visibility is poor; unit costs per prediction remain fuzzy.
When is Kubernetes justified?
- You have dozens of ML services with 24/7 high traffic and:
- Need fine‑grained autoscaling.
- Need tight colocation with other compute.
- Or you have hard GPU packing or custom networking requirements.
Otherwise, serverless + managed services usually wins on total cost of ownership.
2. Over‑centralized “ML platform” that becomes a bottleneck
Pattern:
- A central team is told to “build the ML platform” in one shot.
- They try to standardize everything (data, features, infra, deployment, monitoring) before first success.
- Product teams wait on platform features.
Failure mode:
- 18 months later, you have:
- A half‑used feature store.
- A brittle pipeline DSL.
- Three partially integrated CI/CD systems.
Better pattern: platform as a product with the thinnest viable slice first:
- Start with: deployment templates + basic observability.
- Add: standard inference images & infra modules.
- Only later: feature store, orchestration abstractions.
3. Underestimating cost and concurrency with serverless
Pattern:
- “Lambda is cheap” → no concurrency planning.
- A single ML endpoint accidentally scales to 10k concurrent invocations.
- Downstream:
- DynamoDB throttling.
- S3 request rate limits.
- VPC NAT gateway cost spikes.
Common ML‑specific pitfalls:
- Large model artifacts loaded on every cold start (hundreds of MB).
- Zero control over concurrency per model = noisy neighbors.
Mitigations:
- Concurrency limits per Lambda (and per alias).
- Separate Lambdas / functions per model or model family.
- Keep model artefacts small; use quantization and splitting.
4. Hidden training and experiment costs
Pattern:
- SageMaker or GPU training jobs started from notebooks.
- No lifecycle hooks or budget guards.
- Parameter sweeps and AutoML jobs with default settings.
Outcome:
- “Why did our AWS bill triple?” followed by grep‑driven archeology in CloudTrail.
Mitigations:
- Mandatory tags for any job (project, owner, environment).
- Hard budgets and alerts per team.
- Predefined training job blueprints with:
- Max parallelism.
- Reasonable stopping criteria.
- Checkpointing to resume instead of restart.
Practical playbook (what to do in the next 7 days)
You don’t need a 6‑month roadmap to get value. In a week, you can materially improve reliability and cost.
Day 1–2: Inventory and baseline
-
Map your ML workloads:
- For each model:
- Purpose (what business decision it affects).
- Call pattern (sync/async, latency SLO).
- Current runtime (EC2/ECS/Lambda/SageMaker).
- Traffic volume and schedule (spiky, steady, batched).
- For each model:
-
Collect cost and reliability data:
- Pull:
- Per‑service AWS cost (focus: EC2, SageMaker, Lambda, ECR, data transfer).
- Per‑model request volumes, latency, error rates.
- If you don’t have per‑model metrics, call that out explicitly as a gap.
- Pull:
Day 3: Choose 1–2 target workloads
Criteria for good first candidates:
- Medium business criticality (not life‑or‑death, not toy).
- Non‑GPU, modest payloads.
- Currently running on a snowflake EC2 host or an ageing container.
Goal: Move them to a serverless reference pattern:
- API Gateway → Lambda → model artifact in S3/EFS.
- Or S3 → EventBridge/Step Functions → SageMaker Batch Transform.
Day 4–5: Implement minimal serverless pattern
For the chosen model:
-
Define a strict contract
- Request/response schema (JSON).
- Input size limits; error codes.
-
Create infrastructure as code
- Use CloudFormation or Terraform.
- Include:
- Lambda with reserved/concurrency settings.
- Alarms for error rate and latency.
- Log retention policies.
-
Wire in observability
- Structured logging:
{ model_version, request_id, latency_ms, success, key_features: {...} }
- CloudWatch metrics from logs:
- Latency, success/failure counts.
- Basic X‑Ray integration for traces.
- Structured logging:
-
Shadow deploy
- Mirror traffic from existing endpoint to the new Lambda (read‑only).
- Compare:
- Latency distribution.
- Error rates.
- Output divergence (if possible).
Day 6: Kill obvious waste
Using your baseline:
-
Turn off or downscale:
- Idle GPU instances.
- Zombie SageMaker notebooks.
- Always‑on training boxes.
-
Apply guardrails:
- Tag requirements enforced by SCP or CI.
- CloudWatch budgets + alerts per team.
-
Right‑size:
- Lambda memory (and thus CPU) based on actual runtime.
- Autoscaling rules for any ECS/SageMaker endpoints.
Day 7: Decide your next platform slice
Don’t plan The One Platform; choose a thin vertical slice:
Pick one of:
-
Standardized inference pattern
- Publish a template (repo + IaC) for:
- New model endpoints using Lambda or Fargate.
- Require all new models to use it.
- Publish a template (repo + IaC) for:
-
Basic model registry
- Even a DynamoDB/Aurora table tracking:
- Model name, version, S3 URI, status, deployment stages.
- Integrate with CI so each merge produces a new version entry.
- Even a DynamoDB/Aurora table tracking:
-
ML‑aware dashboards
- Per‑model dashboards with:
- Traffic, latency, error rate, cost approximation.
- Make it the primary “what’s wrong?” view for on‑call.
- Per‑model dashboards with:
Bottom line
You don’t need an “ML platform” in the abstract. You need:
- Predictable deployments.
- Clear costs per prediction/training run.
- Observability that lets you answer:
- “What changed?”
- “What did it cost?”
- “Should we roll back?”
For a large class of workloads, AWS serverless and managed services are the shortest path to that outcome:
- Lambda + Step Functions + EventBridge for orchestration.
- SageMaker where it actually delivers leverage (training, batch).
- Boring storage (S3, Aurora Serverless, DynamoDB) for features and artefacts.
- CloudWatch + X‑Ray + structured logs for observability.
You can always add Kubernetes later if your scale and constraints truly demand it. Most teams never reach that point—and pay a high complexity tax trying anyway.
If you start by treating ML like any other production system on AWS—subject to the same expectations for reliability, security, and cost—you’ll likely end up with a serverless‑first architecture that your existing engineers can run, your finance team can understand, and your CTO can defend to the board.
