Your Serverless Stack Is a Soft Target: Hardening AWS Before an Incident Response Firm Retires on Your Dime
Why this matters right now
Cloud security on AWS used to mean “don’t leak S3 buckets” and “rotate IAM keys.” That’s quaint now.
Serverless, ephemeral infrastructure, and platform engineering have shifted the failure modes:
- Blast radius is smaller, but blast frequency is higher.
- Privilege is more granular, but IAM graph complexity explodes.
- Infra is more automated, but so are misconfigurations.
Attackers aren’t “hacking Lambda” in some exotic way. They’re walking straight through:
- Over-permissive roles on serverless functions and containers
- CI/CD pipelines with broader rights than production workloads
- Misconfigured cross-account access and broken multi-tenant isolation
- AWS-native services (S3, SSM, STS, EventBridge) abused as lateral movement rails
If your mental model is still “WAF + GuardDuty + encryption at rest,” you will detect breaches late and contain them poorly.
The constraint: you still need to ship fast, stay cost-effective, and not turn your platform team into a ticket queue. That means treating security as cloud engineering work, not a separate religion.
This post is about how to do that specifically for an AWS-heavy, serverless-centric stack.
What’s actually changed (not the press release)
Three shifts matter for security in real AWS environments.
1. Everything is a programmable control plane
Terraform, CloudFormation, CDK, Pulumi, GitHub Actions, CodePipeline:
Your entire estate is a giant, scriptable remote shell.
- Previously: compromise a box → pivot by SSH, abuse long-lived creds.
- Now: compromise a pipeline or runner → rewrite IAM, deploy backdoored Lambdas, tweak security groups, add new cross-account roles.
The control plane is the new crown jewel. Most orgs still treat it as “just CI.”
2. Serverless reduced attack surface per node, but multiplied it
Lambda, Fargate, API Gateway, Step Functions, DynamoDB, EventBridge:
- Less: No open ports on random EC2s, no SSH, no patching agents.
- More: Hundreds/thousands of functions and microservices, each with:
- Their own IAM roles
- Their own triggers (API, SQS, S3, EventBridge, Cron)
- Their own environment variables and secrets access
The graph size skyrockets; humans can’t reason about it unaided.
3. Observability matured; identity observability did not
You probably have:
- Centralized logs (CloudWatch + something)
- Metrics and traces (X-Ray, OpenTelemetry, Datadog/Prometheus/etc.)
- Dashboards for latency, errors, cold starts, cost
But:
- IAM access patterns are opaque
- Cross-account role usage is poorly monitored
- STS token issuance isn’t treated as a first-class signal
- “Who can actually do X” is answered with tribal knowledge, not data
This is where real breaches hide: identity relationships, not CPU graphs.
How it works (simple mental model)
A practical mental model for AWS cloud security with serverless:
Infra as code + identity graph + event fabric. Control those three and you control risk.
1. Infra as Code (IaC) is your single source of truth — or it’s lying
Reality is one of:
- Strong IaC discipline:
95%+ of resources created by automation, drift detection in place, manual changes rare and reviewed. - Mixed IaC + click-ops:
CloudFormation/Terraform does most, but people still “just fix it in the console.” - IaC theater:
Templates exist but prod differs, no code review culture, security config drift is constant.
Security posture quality strongly correlates with which bucket you’re in.
2. Identity graph is the real perimeter
Think of your AWS environment as:
- Nodes: Users, roles, services (Lambda, ECS, EC2, CI runners)
- Edges: Which principals can assume which roles, read which S3 buckets, invoke which Lambdas, write which queues, read which Parameter Store paths.
Breaches travel along edges:
- Compromise a CI token → assume
deployment-role deployment-rolehasAdministratorAccessin dev and “almost admin” in prod- Attacker adds new Lambda with inline policy, triggers it with EventBridge, uses it as a covert channel
If you can’t visualize or query this graph, you’re operating blind.
3. Event fabric is the attack choreography
AWS is now an event mesh:
- S3 PUT triggers Lambda
- API Gateway triggers Lambda
- EventBridge schedules or fans out to multiple services
- SQS/SNS glue everything
Good security uses the fabric to:
- Detect abuse paths (unexpected function invocations, role assumptions)
- Enforce policies (runtime checks on dangerous operations)
- Short-circuit damage (auto-revoke, quarantine, disable)
Bad security ignores it and assumes GuardDuty + CloudTrail are “enough.”
Where teams get burned (failure modes + anti-patterns)
Failure mode #1: “Least privilege later” that never comes
Pattern:
- Team launches quickly with “temporary” wildcards:
lambda:InvokeFunction:*s3:*on a shared data bucketdynamodb:*for “experimentation”
- No backlog item ever prioritizes tightening permissions.
Consequence:
- One compromised function or role = direct path to:
- Exfiltrate all customer data from an S3 data lake
- Modify queue consumers to siphon messages
- Rewrite Lambda environment variables to inject secrets exfiltration
Example:
A fintech startup had a Lambda that processed user documents. Its role had s3:* on all buckets in the account “for convenience.” An exposed API vulnerability allowed attackers to invoke that function with arbitrary parameters. Result: bulk document exfiltration, no need to breach S3 directly.
Failure mode #2: CI/CD pipelines as God mode
Pattern:
- GitHub Actions / GitLab / CodeBuild has an IAM role with:
iam:PassRoleon almost everythingcloudformation:*orsts:AssumeRoleinto prod
- Runners are:
- Internet-facing
- Shared across projects
- Using long-lived access tokens
Consequence:
- Compromise the CI platform → silent rewrite of your entire infra posture:
- Insert logic into Lambdas
- Loosen security group rules
- Create backdoor roles
- Add shadow logging sinks
Example:
A mid-size SaaS provider’s GitHub Actions runner had a secret with an IAM user that could assume the prod deployment role. An attacker exploited a third-party GitHub Action, retrieved the secret, and deployed a “diagnostic Lambda” that quietly mirrored a subset of database queries to an attacker-controlled S3 bucket in another account. Detection took months.
Failure mode #3: Over-trusting AWS account boundaries
Pattern:
- “Prod is safe; it’s a separate AWS account.”
- But:
- Dev account has ability to assume cross-account roles into prod
- Shared CI system deploys to all accounts
- Network connectivity or shared secrets cross boundaries
Consequence:
- Breach in dev or staging (usually weaker) used as a trampoline into prod.
- You’ve effectively created a multi-account monolith with porous walls.
Failure mode #4: Observability focused on performance, not identity
Pattern:
- Excellent visibility into:
- p99 latency
- Error rates
- Cold starts
- Lambda durations and memory usage
- Minimal visibility into:
- Unusual role assumptions
- Sudden spikes in
GetParameter/Decryptfor secrets - API calls from unusual regions or services
- Lambda functions that suddenly start invoking KMS, STS, or S3 in new ways
Consequence:
- You detect degradation but not exfiltration.
- Incident response time depends on luck, not telemetry.
Failure mode #5: “Security as a separate platform”
Pattern:
- Security tooling is deployed by security team, not embedded in platforms:
- Separate scanners, separate dashboards, separate pipelines.
- Platform team:
- Doesn’t own findings
- Has incentive to bypass checks under delivery pressure
Consequence:
- Drift between “security expectations” and “what infra actually allows.”
- Paper compliance, weak runtime security.
Practical playbook (what to do in the next 7 days)
You won’t fix everything in a week, but you can materially reduce risk if you’re focused.
Day 1–2: Inventory your identity blast radius
- Dump IAM roles and attached policies in prod and staging.
- Identify:
- Roles with
AdministratorAccessor*:* - Roles used by:
- CI/CD systems
- Lambda functions
- ECS/Fargate tasks
- Roles with
- For each high-privilege role, answer:
- Who/what can assume it? (
sts:AssumeRoletrust relationships) - Which external identities (GitHub, Okta, other AWS accounts) are in the trust policy?
- Who/what can assume it? (
Output: a ranked list of “if this principal is compromised, how bad is it?”
Day 3: Lock down CI/CD before anything else
- Reduce CI/CD role privileges:
- Aim for:
- Specific CloudFormation stacks or prefixes
- Limited
iam:PassRoleto an explicit allowlist - No direct
AdministratorAccess
- Aim for:
- Separate roles per environment:
ci-deploy-dev,ci-deploy-staging,ci-deploy-prodwith progressively narrower permissions.
- Harden runner environment:
- Use short-lived credentials from OIDC or STS instead of stored AWS keys.
- Ensure runners are not shared across organizations or projects with different trust levels.
This is usually the single highest-leverage security move for AWS cloud engineering.
Day 4: De-risk your serverless roles
For your top 20 most-invoked Lambda functions / Fargate tasks:
- Check their IAM role:
- Remove wildcards:
- Replace
s3:*with the 2–3 required actions. - Restrict S3 resources to specific buckets and (ideally) prefixes.
- Replace
- Split duties:
- E.g., one function for reading from S3, another for writing to DynamoDB, each with minimal permissions.
- Remove wildcards:
- Disable environment variable secrets where possible:
- Move secrets to Parameter Store (with KMS) or Secrets Manager.
- Ensure function roles have least-privilege access to paths or secrets.
Even if you only fix 5–10 functions, you materially reduce the probability of catastrophic exfiltration.
Day 5: Add identity-focused observability
Use CloudTrail and whatever logging stack you have. Add basic detectors:
- Alerts on:
AssumeRoleinto production roles from unfamiliar principals or accountsPutUserPolicy,PutRolePolicy,AttachRolePolicyin prod- Sudden spikes in:
GetParameterfor secure stringsDecryptcalls for KMS keys used for secrets
- Simple sanity metrics:
- Count of IAM roles over time (alert on unusual growth)
- Number of functions with
*:*in their policies
Correlate these with your existing logging system and paging mechanism. This is not full-blown detection engineering; it’s basic smoke alarms.
Day 6–7: Move one security control into your platform
Choose one of:
- IAM policy linting in CI:
- Reject merges where a new or changed IAM policy includes:
*:*iam:*kms:*s3:*on account-wide resources
- Reject merges where a new or changed IAM policy includes:
- Baseline resource tagging for security:
- Enforce tags like
owner,data_classification, andenvironmenton S3 buckets, Lambdas, and databases. - Use them to drive:
- Backup policies
- Access alert thresholds
- Cost and risk reporting
- Enforce tags like
In both cases, the platform team owns the mechanism. Security defines guardrails; platform makes them real.
Bottom line
For modern AWS cloud engineering — serverless, IaC, platform teams — security is no longer a separate discipline bolted on at the edge. It’s:
- Identity and permissions engineering
- Control-plane hardening
- Event-driven detection built into your platform
Focus less on “is S3 encrypted at rest?” and more on:
- Who can change which IAM roles, from where?
- What can your CI/CD system actually do?
- How far can an attacker go if they own a single Lambda, or a single GitHub Action?
- Can you detect and contain that within hours, not weeks?
The teams that get this right treat AWS security as a core feature of their platform — versioned, tested, observable — not a separate compliance box.
If you can’t answer “what’s the worst thing this role could do if stolen?” with data, not vibes, you have work to do.
