Your VPC Is Not a Security Boundary: Hard Truths About AWS Serverless Security
Why this matters right now
Serverless on AWS has quietly become the default for a lot of new work:
- APIs on API Gateway + Lambda or Lambda + ALB
- Event-driven glue with SQS, SNS, EventBridge
- Data pipelines with Kinesis, Lambda, Step Functions, DynamoDB
- Internal platforms built on top of these building blocks
From a security and cloud engineering perspective, that changes where you’re actually exposed:
- You have more identity and permissions edges (IAM, resource policies, service roles)
- You have less control of the OS and network, but more control of app logic and policies
- Observability and incident response now depend heavily on logs, traces, and config state, not host forensics
At the same time:
- Attackers are actually exploiting misconfigured IAM, public S3, vulnerable Lambda dependencies, and over-permissioned roles.
- Cloud bills are quietly inflated by always-on security scanning tools you barely use or understand.
- Platform teams are now responsible for securing capabilities, not just EC2 instances.
This post is about building a realistic mental model for AWS serverless security, and avoiding the assumptions that get teams breached or burned.
What’s actually changed (not the press release)
Strip away the marketing; the real changes with AWS serverless from a security perspective:
-
Network is weaker as a primary control; identity is stronger
- You used to think in terms of “inside the VPC = trusted.”
- With Lambda, API Gateway, EventBridge, S3, DynamoDB, Step Functions:
- Many things are publicly addressable by design (APIs, S3 policies, EventBridge buses).
- Resource policies + IAM conditions + auth layers are now more important than subnets and security groups.
- VPC-only Lambdas are common, but they often call public AWS control plane APIs that matter more than east-west traffic.
-
The security boundary is now at the integration surface
- Instead of “the server running this app,” you have:
- API Gateway → Lambda → SQS → Lambda → DynamoDB → EventBridge → Step Functions → external SaaS
- Each hop has a separate auth and authorization model (IAM roles, resource policies, JWTs, custom auth).
- Mistakes in any one can yield privilege escalation or data exfiltration.
- Instead of “the server running this app,” you have:
-
Blast radius lives in IAM, not instance size
- A single over-permissioned Lambda execution role is now equivalent to a compromised root process on a large EC2 box.
- Lambda concurrency and auto-scaling can amplify an issue very quickly:
- Misconfigured role + compromised function = rapid data shredding or extraction.
-
Observability has become your incident response substrate
- No host-level forensics. You have:
- CloudTrail, Lambda logs, API Gateway logs, VPC Flow Logs, ALB logs, X-Ray traces, Config snapshots.
- If you didn’t design for this upfront, your “forensics” during an incident will be guesswork.
- No host-level forensics. You have:
-
Platform engineering and security are now coupled
- Internal platforms (golden paths, templates, “paved road” stacks) define:
- Default IAM boundary strength
- Logging defaults
- Encryption defaults
- How easy it is to accidentally publish something to the internet
- In practice, the platform enforces most of your real security posture, not just your security team.
- Internal platforms (golden paths, templates, “paved road” stacks) define:
How it works (simple mental model)
You can reason about AWS serverless security using three layers:
1. Identity and policy plane (the real perimeter)
This is IAM, resource policies, and auth:
- Principals: IAM roles (Lambda execution roles, ECS tasks, CI jobs), users, federated identities.
- Permissions:
- Identity policies (attached to roles/users)
- Resource policies (S3 bucket policy, Lambda resource policy, API Gateway resource policy, SQS queue policy, KMS key policy).
- Conditions:
aws:SourceArn,aws:SourceAccount, VPC endpoint conditions, source IP, organization ID, tags.
Think:
“Who can call which API, with what conditions, and what can that API then do on their behalf?”
This is where most real-world breaches start:
– Publicly accessible API with weak or missing authorization
– Lambda execution role with *:* or wildcards across sensitive services
– Resource policies that allow cross-account access without strong conditions
2. Data plane (movement and storage of data)
This is what actually flows:
- S3 objects, DynamoDB items, Kinesis records, SQS messages, event payloads, logs.
- Movement between services: Lambda reads/writes, Step Functions transitions, SNS fan-out, EventBridge routing.
Security questions here:
- Is data encrypted at rest (KMS keys, key policies, grants)?
- Is data encrypted in transit (TLS, mTLS where relevant, private endpoints)?
- What’s the maximum fan-out of any given message or event?
- Where can sensitive payloads leak (logs, DLQs, retry queues, debug output)?
3. Control plane (configuration and observability)
This is how the system is configured and monitored:
- CloudFormation, CDK, Terraform, serverless frameworks
- CloudWatch Metrics, Logs, X-Ray, CloudTrail, AWS Config
- Security tooling (Config rules, Security Hub, GuardDuty, custom detectors)
Security issues here:
- Who can change security-relevant config (IAM, bucket policies, API Gateway authorizers)?
- Are security-relevant changes reviewed and auditable?
- Do you have alerts that fire on misconfigurations, not just runtime events?
- Can you reconstruct “what happened” from logs if something goes wrong?
When you think about a serverless system, analyze it top-down:
- Identity plane: Who can call it / what can it call?
- Data plane: What sensitive data flows through / where can it go?
- Control plane: Who can change its behavior or weaken its security?
Where teams get burned (failure modes + anti-patterns)
1. “VPC = safe” fallacy
Example:
An internal team builds a “private” Lambda API behind an ALB, in private subnets. They:
- Use AWS SDK from the function to hit various services.
- Give the function an execution role with
AdministratorAccessduring prototyping. - Never tighten it.
Incident pattern:
- SSRF or injection vulnerability in their handler → attacker runs arbitrary AWS SDK calls using that role.
- Result: reading S3 buckets, secret exfiltration from Secrets Manager, changing IAM.
Root issue: IAM, not the VPC, determined the real blast radius.
2. Over-permissioned cross-account access
Example:
A platform account hosts an EventBridge bus for org-wide events. An application account has:
- A rule in the platform account sending events to its bus.
- A resource policy on the app account’s bus allowing events from the platform account with
Principal: *and missing or weakaws:SourceAccountconditions.
Failure mode:
- Any compromised principal in the platform account can send arbitrary events into the app account’s bus.
- Downstream Lambdas in the app account trust event origin and may take sensitive actions.
3. Public APIs with “soft” auth
Example:
- API Gateway + Lambda public endpoint for partner integrations.
- Security handled via custom headers or weakly validated JWTs.
- No WAF, no meaningful rate limits, no strict auth enforcement at the gateway level.
What goes wrong:
- Attackers brute-force or bypass the soft auth.
- Even without full compromise, they drive up Lambda invocations and downstream database load (cost + availability hit).
- Logs fill with sensitive payloads from noisy attack traffic.
4. Logs as a liability
Example:
- Lambdas and Step Functions log entire request/response payloads (including tokens, PII, secrets) to CloudWatch Logs for “debugging.”
- Devs forget to sanitize or disable this in production.
- No retention policies or log access limits.
Risk:
- Anyone with broad CloudWatch Logs or Athena-on-logs access has indirect access to sensitive data.
- In an incident, logs themselves become a breach surface.
5. Security tools that blindside your bill without improving posture
Pattern:
-
Multiple agents / scanners / config tools, each:
- Running frequent API scans across all accounts
- Producing noisy findings someone occasionally exports to a spreadsheet
- Little integration into CI/CD or platform templates
Impact:
- Material AWS bill from these tools and their underlying data stores.
- Engineering teams ignore alerts due to baseline noise.
Security-wise, you’ve just bought threats you won’t act on.
Practical playbook (what to do in the next 7 days)
Assuming you run on AWS with some serverless footprint (Lambda, API Gateway, SQS, SNS, EventBridge, DynamoDB, S3):
Day 1–2: Establish where your real risk sits
-
Inventory critical serverless workloads
- APIs handling auth, payments, PII
- High-volume event pipelines
- Cross-account shared infrastructure (buses, buckets, KMS keys)
For each, list:
- Entry points (public/private APIs, events, queues)
- IAM roles involved (execution roles, CI roles, cross-account roles)
-
Quick IAM blast radius review
-
For top-10 most critical Lambda functions, inspect the execution role:
- Look for
*actions ("Action": "*") or wide wildcards ("dynamodb:*","s3:*") on sensitive resources. - Check if the role can modify IAM, KMS, CloudTrail, S3, or parameter stores.
- Look for
-
If you find obvious over-broad roles, document, don’t fix yet; you want a plan, not ad hoc edits.
-
Day 3–4: Lock the identity plane down one step
-
Define a hard “no” list for Lambda roles
- Unless absolutely necessary, Lambda execution roles should NOT have:
iam:*kms:*(beyond specific key usage)s3:*on all bucketssts:AssumeRoleinto more-privileged accounts
- Unless absolutely necessary, Lambda execution roles should NOT have:
-
Apply minimal scope to 2–3 critical Lambdas
-
Rewrite their IAM policies to:
- Use resource-level constraints (specific buckets, tables, queues).
- Prefer allowed actions lists (e.g.,
dynamodb:GetItem,dynamodb:PutItem) overdynamodb:*.
-
Validate with integration tests (or targeted manual tests) to avoid breaking prod.
-
-
Add missing resource policy conditions on cross-account resources
- For S3, SQS, EventBridge, KMS used cross-account:
- Ensure you have both
PrincipalandCondition(aws:SourceArn,aws:SourceAccount) locked down. - Avoid “allow org” wildcards without checking if that’s truly desired.
- Ensure you have both
- For S3, SQS, EventBridge, KMS used cross-account:
Day 5: Get observability to minimum viable incident-response
-
Turn on / verify key logs
- CloudTrail org-level, multi-region for management events.
- API Gateway access logs for public APIs with at least: status code, caller IP, latency, request path.
- Lambda logs with sanitized payloads (remove tokens, PII).
-
Set coarse but meaningful alerts
Use CloudWatch alarms or existing tooling to alert on:
- Sudden spike in API 5xx for critical APIs
- Sudden spike in Lambda errors for security-sensitive functions
- CloudTrail events:
- IAM policy changes
- KMS key policy changes
- CloudTrail modifications
You’re not building a full SIEM in a week; you’re establishing tripwires.
Day 6: Fix the worst logging liabilities
-
Search for PII or secrets in logs:
- Spot-check Lambda handlers for
console.log/printof full payloads. - Identify services where request/response bodies are routinely logged.
- Spot-check Lambda handlers for
-
For the worst offenders:
- Remove or redact sensitive fields from logs.
- Add basic logging guidelines to your internal platform templates / coding standards.
Day 7: Align platform engineering with security
-
If you have a platform team, do a 60–90 minute working session with security:
- Review your “golden path” templates (CDK constructs, Terraform modules, starter repos).
- Ensure defaults include:
- Encrypted data stores with customer-managed KMS keys where warranted
- Least-privilege IAM roles baked into templates
- Logging enabled by default and reasonably structured
- Sensible API Gateway + Lambda auth patterns (Cognito, JWT validation, custom authorizers, or ALB auth)
-
Decide one short feedback loop:
- E.g., security reviews changes to platform modules and provides threat models, not individual app stacks.
- Over time, this will have more impact than chasing every single misconfigured Lambda.
Bottom line
On AWS serverless, the real security boundary is:
- IAM principals and policies
- Resource policies and their conditions
- The pathways data can take between managed services
- The observability you have when something goes wrong
VPCs, subnets, and security groups still matter, but they’re no longer the primary defense.
If you:
- Treat every Lambda execution role like a root shell on your account,
- Make cross-account and public entry points explicit and tightly constrained,
- Design logs and metrics as incident-response tools, not afterthoughts,
- And push these rules into platform-level defaults,
you’ll be meaningfully safer than most AWS users, without doubling your cloud bill or your cognitive load.
The hard part isn’t more tooling or more features. It’s accepting that your “servers” are now policies, identities, and events—and engineering them with the same discipline you used to apply to machines.
