Stop Treating AI as a Junior Dev: Design It as Infrastructure

Why this matters this week
The conversation in engineering orgs has quietly shifted from:
“Should we use AI for code?”
to
“Why are our tests flaky and our infra bill up 20% since we turned this on?”
If you’re running a team that ships production systems, the real questions aren’t about model benchmarks. They’re about:
- How does AI code generation affect defect rates, on-call, and MTTR?
- Does test generation actually increase confidence, or just add noise?
- What changes in your SDLC keep you from shipping risky auto-generated code to prod?
- How do you control cost and latency when AI is in the critical path?
This week, multiple large shops I talk to independently hit the same wall: they “won” on AI adoption (everyone uses a code assistant now) but:
- Regression volume increased.
- Test suites got slower and less reliable.
- PR review quality declined because humans implicitly trusted “AI wrote it.”
The pattern is clear: treating AI as a productivity perk instead of as production infrastructure creates silent failure modes. The interesting work right now is not “try AI” — it’s turning AI-assisted engineering into something you can reason about, measure, and roll back.
What’s actually changed (not the press release)
Three real shifts are hitting software teams that operate at scale:
-
The unit of work is moving from “line of code” to “semantic change”
Engineers using codegen tools are:
- Touching more files per change.
- Making wider refactors (because it feels cheap).
- Generating more boilerplate (e.g., DTOs, mappings, wrappers).
The outcome: the surface area of each change grows, but review discipline often stays calibrated to pre-AI change sizes.
-
Tests are now generated, not designed
AI test generation does a few things well:
- Fills in obvious cases (happy path, simple edge cases).
- Hits “coverage” thresholds quickly.
But it tends to:
- Not encode invariants or business rules precisely.
- Overfit to current implementation instead of spec.
- Produce brittle tests tied to specific error messages or internal APIs.
Result: coverage goes up, true confidence does not.
-
SDLC steps are getting silently bypassed
Common patterns:
- Devs accept AI suggestions that update infra/config (e.g., Terraform, Kubernetes, IAM) and rely on shallow review.
- PR reviewers skim, assuming “the hard part was already done by the model.”
- “Minor” AI-suggested changes (logging, metrics, small refactors) ship without appropriate testing.
The net effect is policy drift: your process on paper (strict reviews, infra guardrails) diverges from what actually happens in day-to-day coding.
How it works (simple mental model)
A useful way to reason about AI in software engineering is:
AI is a probabilistic refactoring engine plugged into a deterministic system.
Break that down:
-
Probabilistic generator
- Given context (code, comments, diffs), it predicts plausible next code tokens.
- It does not know:
- Your production SLAs.
- Historical outage patterns.
- The political cost of breaking the CFO’s dashboard.
- It optimizes for local plausibility, not global system safety or cost.
-
Deterministic enforcement
- Your CI, static analysis, type system, tests, and deployment gates are deterministic.
- They’re supposed to filter bad probabilistic outputs.
- But most orgs’ current guardrails were tuned for human-originated changes, which:
- Are smaller.
- Contain more implicit domain knowledge.
- Arrive at lower volume.
-
Feedback loop
- AI speeds up “attempt rate” (more candidate changes per unit time).
- If your filters (tests, reviews, gates) aren’t proportionally stronger:
- More low-quality changes slip through.
- Or your CI becomes a bottleneck (timeouts, queueing, flaky runs).
- Either way, your system dynamics change: you’ve increased mutation rate without adjusting immune system strength.
So, mentally:
- AI is not “another engineer.”
- It’s more like a very powerful macro system that:
- Has strong priors from public code.
- Ignores your local incident history.
- Produces “reasonable-looking” changes at high speed.
If your SDLC treats it like a coworker instead of a subsystem, you get unpredictable behavior.
Where teams get burned (failure modes + anti-patterns)
1. “Coverage LARPing”: High coverage, low assurance
Pattern:
– Turn on AI test-gen.
– Coverage jumps from 65% → 85%.
– Leadership feels great. On-call does not.
What actually happens:
– Generated tests assert on superficial behavior (exact strings, current data formats).
– No property-based or invariant tests are added.
– Tests rarely capture negative business logic (“must never charge card before X”).
Symptoms:
– Minor refactors break dozens of tests.
– Security or data integrity bugs slip through despite “great coverage.”
Anti-pattern: Treating coverage % as a safety metric instead of a signal for missing scenarios.
2. “Hidden infra edits” in mixed diffs
Pattern:
– An engineer asks the assistant to “wire this feature end-to-end.”
– The model:
– Edits app code.
– Modifies CI/workflow YAML.
– Tweaks Dockerfiles or deployment manifests.
– All in a single big PR.
Where it blows up:
– The infra changes get superficial review or none.
– A subtle IAM or network rule change goes live.
– Incident shows up as “random infra failure” days later.
Anti-pattern: Allowing multi-scope AI changes (code + infra + policies) in one review unit.
3. “Rubber-stamp reviews” driven by misplaced trust
Pattern:
– Reviewer sees AI-generated PR description plus polished diff.
– Reviewer assumes “basic stuff is probably correct.”
– Focus is on style and small comments, not system impact.
Symptoms:
– Obvious boundary violations (time zones, currencies, PII handling) slip through.
– Teams overestimate test quality and under-invest in exploratory or chaos testing.
Anti-pattern: Treating AI provenance as a quality signal instead of a risk flag that requires more scrutiny.
4. Silent cost and latency creep
Pattern:
– Teams add AI-based tools into:
– CI checks (linting, review bots, test suggester).
– Preview environments.
– Dev workflows that call external APIs.
Where it hurts:
– CI times increase by 2–5 minutes.
– AI API costs scale with PR volume and test cycles.
– Nobody has a clear owner for “AI infra cost.”
Symptoms:
– Engineers quietly turn off CI checks locally.
– Teams batch PRs to avoid slow pipelines, increasing blast radius.
Anti-pattern: Adding non-essential AI calls in the critical path of every commit or test run.
Practical playbook (what to do in the next 7 days)
This is a one-week, low-drama plan to make AI-assisted engineering less risky and more measurable.
Day 1–2: Instrument and baseline
-
Measure three things now (before more rollout):
- Average PR size (lines changed, files touched) vs 3 months ago.
- CI duration and failure rate (by job and by cause).
- Bug/regression rate attributable to:
- Logic errors.
- Integration issues.
- Configuration/infra changes.
-
Tag AI-touchpoints
- Decide how you’ll mark AI-affected changes:
- Label on PRs (“ai-assisted”).
- Flag in commit messages (simple prefix).
- The goal: later correlate bug density and revert rate with AI involvement.
- Decide how you’ll mark AI-affected changes:
-
Disable AI for sensitive scopes (temporarily)
- Define zones where AI must not auto-edit without additional review:
- Secrets management, auth, encryption, billing logic, data retention.
- Enforce via:
- Codeowners.
- Path-based checks in CI.
- Repo or IDE configuration if available.
- Define zones where AI must not auto-edit without additional review:
Day 3–4: Harden your SDLC interfaces
-
Split changes: code vs infra vs policy
Policy: no mixed-scope AI-generated PRs. Enforce:
- One PR for app logic.
- Separate PR for infra/config (Terraform, Helm, GitHub Actions, etc.).
- Another for policy (RBAC, firewall, organization-wide settings).
This lets you:
- Apply different reviewers and test strategies.
- Roll back surgically when things go wrong.
-
Reframe test generation as “test stubs,” not full coverage
For AI-generated tests:
- Require human-written assertions for:
- Security invariants (authZ, data access).
- Monetary calculations and rounding.
- Data lifecycle constraints (retention, deletion).
- Only count tests as “critical coverage” if:
- Explicit invariants are visible.
- Assertions are meaningful beyond status codes and string matches.
- Require human-written assertions for:
-
Adjust code review norms
Add a rule of thumb for reviewers:
- For AI-assisted diffs:
- Spend less time on variable naming.
- Spend more time on:
- Boundary conditions (time, locale, concurrency, partial failures).
- Data movement (PII flows, encryption boundaries).
- Cross-service contracts (backwards compatibility, schema evolution).
If you can tweak tools:
- Surface the risk hot spots in review UI: config changes, public API changes, data access changes.
- For AI-assisted diffs:
Day 5–7: Tighten feedback loops and cost control
-
Introduce canary patterns for AI-generated logic
For higher-risk code (new services, critical flows):
- Ship with:
- Feature flags or environment flags.
- Request-level sampling for:
- Additional logging.
- Shadow traffic checks against prior behavior (where feasible).
- Define explicit rollback conditions:
- Error rate thresholds.
- Latency regression thresholds.
- Business metric anomalies (conversion, refunds, auth failures).
- Ship with:
-
Cap and monitor AI cost and latency
For AI-based tools in your SDLC:
- Introduce:
- Rate limits per user and per pipeline.
- Timeouts: if a suggestion/review doesn’t return in X seconds, skip it rather than blocking CI.
- Track:
- Cost per PR / per 1k LOC changed.
- Added CI wall-clock time attributable to AI jobs.
If you’re building internal AI tooling:
- Default to async AI checks that augment but don’t
- Introduce:
