Designing Systems That Assume You’ll Be Breached

Table of Contents

Why this matters this week

If you run anything non-trivial in the cloud, you are already running a security program—even if you pretend you’re not.

This week’s driver isn’t a single headline; it’s a convergence:

Cloud providers keep tightening defaults (more identity controls, stricter token lifetimes, posture checks), which break fragile infra.
Software supply chain incidents are no longer “big vendor only” problems; smaller SaaS vendors are getting hit and becoming pivot points into larger orgs.
Regulators and large customers are quietly updating contracts: incident response, SBOMs, and minimum security baselines are becoming go/no-go for deals.

The pattern: teams that bolted security on later are now burning quarters untangling IAM, secrets, and incident response. Teams that designed for compromise early are moving fast with fewer explosions.

“Cybersecurity by design” is not a slogan here. It’s about structuring:

Identity and access
Secrets and machine credentials
Cloud security posture
Software supply chain
Incident readiness

…so that the system assumes components will fail or be breached, and fails in contained ways.

What’s actually changed (not the press release)

A few concrete shifts that materially affect how you should design systems:

Identity is now the real perimeter
- Workloads talk to workloads more than humans talk to apps.
- Machine identities (service principals, workload identities, IAM roles, SPIFFE IDs) are now more powerful and more numerous than human accounts.
- Attackers go after CI/CD, build agents, and cloud roles because that gives broad lateral movement.
Secrets are moving from “static text files” to “short-lived credentials”
- Cloud-native auth (STS tokens, workload identity, OIDC federation) is replacing:
  - Static API keys in env vars
  - Long-lived access keys in CI
- This reduces blast radius but breaks older patterns (e.g., manually copying keys into GitHub actions).
Cloud misconfiguration is the mainstream incident vector
- Real breach patterns in the last 12–18 months:
  - Over-permissive IAM roles that can assume other roles.
  - Public buckets or storage accounts with “temporary debug” exceptions that become permanent.
  - Security tooling with “god mode” credentials stored poorly.
- Most of these are known-bad patterns, but teams don’t have feedback loops (no posture scanning or enforcement).
Software supply chain is no longer optional hygiene
- Package repos keep shipping compromised packages and typosquats.
- Build pipelines get hit by:
  - Malicious plugins
  - Compromised build runners
  - Build artifact tampering
- Customers are starting to ask:
  - “How do you sign releases?”
  - “Do you verify dependencies?”
  - “Can you revoke compromised artifacts?”
Incident response is shifting from “best effort” to “measurable capability”
- It’s no longer enough to have a PDF playbook.
- Evidence required in bigger deals / audits:
  - Time to detect (MTTD) and time to respond (MTTR) from the last real incident.
  - Proof you can rotate credentials and revoke access within hours.
  - Logs that actually exist, are queryable, and are retention-appropriate.

How it works (simple mental model)

Use this as a design mental model:

Assume every boundary eventually fails; design for controlled blast radius and fast revocation.

Break it into five layers:

Identity: Who/what is allowed to do what, where
- Humans: SSO + MFA + least privilege; everything else is a break-glass exception.
- Machines: Workload identities that are:
  - Tied to environment and workload
  - Short-lived tokens
  - Scoped permissions per service
Treat each identity as a crypto-backed capability: if it’s stolen, what can the attacker do before you notice and revoke?
Secrets: Where long-lived power lives (and how you minimize it)
- Ideal: secrets only exist in:
  - A secrets manager (KMS-backed)
  - Memory of a process that just fetched them
- Prefer:
  - Short-lived tokens over static keys
  - KMS-encrypted data over app-managed crypto
- The design goal: no plaintext secrets in repos, CI variables, or local config files.
Cloud security posture: The global configuration surface
- Think of this as “global tuning knobs”:
  - Org policies (no public buckets, no unencrypted disks, no overly-broad roles)
  - Logging and audit defaults
  - Network egress controls
- You want policies as code (e.g., Terraform + policy engine) and continuous posture scanning.
Supply chain: Trust boundaries in your build + dependency graph
- Every tool that can:
  - Write to your repos
  - Modify your builds
  - Publish artifacts
    …is effectively part of your trust boundary.
- Use signature + verification:
  - You sign what you build.
  - Runtimes / deploy steps verify signature + provenance.
Incident response: The “oh no” layer
- Can you answer quickly:
  - What was accessed?
  - Which identities were abused?
  - Which environments are impacted?
- Can you:
  - Kill sessions and rotate keys fast?
  - Contain environments (e.g., isolate a VPC or namespace)?
- This only works if logging, tracing, and identity mapping were designed up front.

Where teams get burned (failure modes + anti-patterns)

Some anonymized patterns that recur:

“Single CI key to rule them all”
- Pattern:
  - One CI system has a long-lived cloud key with admin or near-admin access.
  - Key stored as a CI secret; never rotated because “it would break everything”.
- Failure:
  - CI worker compromised (supply chain or credential leak).
  - Attacker uses key to enumerate and pivot across all environments.
- Fix pattern:
  - Move to short-lived, scoped credentials per pipeline and per environment.
  - Use OIDC / workload identity, not static keys.
“Temporary public access” that never goes away
- Pattern:
  - Engineer opens a storage bucket or internal endpoint “just for debugging”.
  - No central view of public exposures; it remains open for months.
- Failure:
  - Opportunistic scanning finds exposed data or a control plane endpoint.
- Fix pattern:
  - Org-wide guardrails: disallow public by default; exceptions require tagged, time-bound justification plus alerting.
  - Automated scans for public resources with notifications to owners.
“Everything is allowed in dev”
- Pattern:
  - Dev environment has:
    - Shared admin accounts
    - Full production data copies
    - Open ingress from the internet
  - Justification: “it’s not production”.
- Failure:
  - Attacker lands in dev (weaker protections) and rides shared access to prod data.
- Fix pattern:
  - Dev/stage get fewer privileges and no live PII.
  - Access separation enforced via IAM + network; no shared credentials.
“Incident response by Slack thread”
- Pattern:
  - No pre-defined incident commander, no checklists.
  - During an event, engineers improvise via chat.
- Failure:
  - Actions conflict (e.g., one person rotates keys mid-investigation).
  - No clear evidence trail; postmortem is guesswork.
- Fix pattern:
  - Lightweight but explicit incident runbook, on-call rotation, and logging of decisions (even in a shared doc).

Practical playbook (what to do in the next 7 days)

You can’t “secure by design” retroactively in a week, but you can establish direction and guardrails.

Day 1–2: Visibility and inventory

Identity inventory (humans + machines)
- List:
  - All cloud IAM users, roles, service accounts.
  - All CI/CD identities and what they can access.
- Flag obvious risks:
  - Admin access tied to individuals instead of roles.
  - Long-lived access keys older than 90 days.
  - CI roles with wildcard permissions (e.g., *:*).
Secrets inventory
- Where do secrets live today?
  - Git repos (config, scripts)
  - CI variables
  - Local .env files
  - In-app config files
- Run a secret scanner across main repos to confirm your intuition (you’ll be wrong).

Day 3–4: Quick guardrails

Enforce minimal identity hardening
- For humans:
  - Turn on mandatory MFA for SSO and cloud console.
  - Remove direct IAM users where feasible; use SSO → role assumption.
- For machines:
  - For at least one core pipeline, replace a static key with:
    - OIDC → cloud role assumption (AWS/GCP/Azure) or
    - A short-lived token from your identity provider.
Centralize new secrets creation
- Pick one secrets manager (cloud-native or HashiCorp Vault–style).
- New secrets from this week onward:
  - Must be created in the manager.
  - Referenced by name/path only; no plaintext in code.
- You’re not fixing all existing secrets; you’re stopping the bleeding.

Day 5: Baseline cloud security posture

Turn on baseline posture checks
- In your main cloud account(s):
  - Enable the built-in security posture service (config/security center equivalent).
  - Focus on:
    - Public storage or databases
    - Unencrypted volumes
    - IAM roles with admin/wildcard
- Create two buckets:
  - “Fix now” = trivial + high impact (e.g., test bucket that is public).
  - “Fix later” = needs design work (e.g., re-architect network).
Restrict the most dangerous egress
- Identify:
  - Places where workloads have unrestricted outbound internet.
- For at least one critical environment:
  - Introduce egress controls (egress gateway, firewall rules, or service policies).
  - Whitelist only what that environment truly needs (artifact repo, updates, etc.).

Day 6: Supply chain minimum viable controls

Lock down CI/CD trust boundaries
- For CI:
  - Ensure runners that deploy to production are:
    - Dedicated (not shared with public or untrusted projects).
    - Patched regularly.
  - Limit which repos/pipelines can deploy to each environment.
- For dependencies:
  - Turn on dependency scanning in at least one primary language ecosystem.
  - Block or alert on:
    - Dependencies from personal forks when not approved

Designing Systems That Assume You’ll Be Breached

Why this matters this week

What’s actually changed (not the press release)

How it works (simple mental model)

Where teams get burned (failure modes + anti-patterns)