Policy-as-Code or Policy-as-Slide-Deck? Getting Serious About AI Data Governance

Why this matters this week
If you’re putting LLMs anywhere near production data, you’ve quietly become a data regulator for your own company.
In the last two months, three things have converged for most engineering orgs:
- Security teams are finally reading your AI design docs.
- Customers are starting to add “AI use” sections to DPAs and security questionnaires.
- Internal uses of “chat with our data” are leaking beyond the pilot group.
This shifts AI from “innovation sandbox” to “regulated system,” even if you’re not in finance or healthcare. The practical consequences:
- Data retention and provenance now affect model outputs and legal exposure.
- Auditability is no longer “keep some logs”; it’s “prove what you did with which data, when.”
- SOC2/ISO controls are being interpreted to cover generative AI systems, not just classic SaaS.
- Policy-as-code moves from nice-to-have to the only scalable way to keep AI usage compliant.
If you don’t intentionally design for privacy & governance, your default posture is:
– “We’ll discover our incidents from angry customers and leaked prompts,” and
– “We’ll rebuild this in six months when GRC blocks our enterprise deals.”
What’s actually changed (not the press release)
Notable real-world shifts from engineering teams in the last quarter:
-
Vendors are pushing more logging and controls… but not wiring them for you.
- LLM APIs now expose:
- Data retention flags
- Log redaction options
- Fine-tuned model isolation controls
- Most teams:
- Set the flags once in a config file
- Don’t validate that they’re actually enforced end-to-end
- Never tie these settings back to a formal data handling policy
- LLM APIs now expose:
-
“Shadow AI” is no longer just people using public chatbots.
Three patterns:- Teams quietly deploying internal RAG tools on shared infra without:
- Data classification checks on what can be indexed
- Tenant isolation between customers
- Product teams calling LLM APIs directly from services that have access to PII, assuming “we’re anonymizing it.”
- BI/analytics teams dumping prod snapshots into vector stores “for experimentation,” then never deleting them.
- Teams quietly deploying internal RAG tools on shared infra without:
-
Auditors and customers are asking pointed AI questions.
SOC2/ISO auditors and enterprise customers now routinely ask:- “Which models process customer data, and where are they hosted?”
- “What is your data retention period for prompts and generated content?”
- “Can you demonstrate that training data is not reused without consent?”
Teams often don’t have a single, authoritative answer.
-
Legal risk from model behavior is becoming concrete.
Two anonymised examples:- A fintech tool generated investment summaries that appeared to reference specific customer transactions. The system hadn’t logged which data sources were consulted, so the team couldn’t prove whether PII was or wasn’t used.
- A support assistant hallucinated a data handling policy that was stricter than reality. A customer cited this in a dispute. Because there was no content provenance or disclaimers, legal had to negotiate based on an LLM’s fabrication.
The net: we’re past the “toy app” era. Governance and privacy controls now directly affect revenue, deal velocity, and incident response.
How it works (simple mental model)
You can think about AI privacy & governance as four control planes that need to line up:
-
Data Plane – What data can flow where?
- Sources: production DBs, data warehouse, ticketing systems, customer uploads.
- Sinks: LLM APIs, vector stores, fine-tuning datasets, logs.
- Core questions:
- Which classes of data (PII, PHI, secrets, internal-only) are allowed into which AI systems?
- Under what transformations (masking, aggregation, synthetic data)?
-
Model Plane – What behavior is allowed?
This is model risk, not model accuracy:- Can the model:
- Reference specific users or accounts?
- Perform actions (API calls, workflow automations)?
- Generate content that looks like policy or contract language?
- Controls:
- System prompts and guardrails
- Tool/action whitelists
- Output filters (e.g., PII re-detectors)
- Can the model:
-
Policy Plane – What are the rules, and how are they encoded?
- You probably already have:
- Data classification policy
- Data retention policy
- Access control policy
- For AI, you need:
- AI-specific interpretations of those policies
- Machine-readable constraints that can be enforced at runtime
- This is where policy-as-code appears:
- Express “what is allowed” in something like OPA/Rego, Cedar, or homegrown rules
- Wire those policies into request routing, feature flags, and data access layers
- You probably already have:
-
Audit Plane – How do we prove what happened?
- Logs that tie together:
- Who/what called the AI system (user, service).
- Which data sources were consulted (indices, tables, tenants).
- Which policy decisions were evaluated and their outcomes.
- Which model was used (version, provider, configuration).
- For SOC2/ISO alignment:
- You don’t need perfect lineage.
- You do need:
- Consistent, queryable logs
- Regular reviews (and evidence of those reviews)
- Demonstrable access controls and change management
- Logs that tie together:
Everything else (fancy governance dashboards, model catalogs, etc.) is mostly UX around these four planes.
Where teams get burned (failure modes + anti-patterns)
Common patterns from real deployments:
-
“We mask PII in the prompt, so we’re safe.”
Problems:- Masking often happens ad-hoc in app code, not at a central boundary.
- Edge cases:
- Free-form text where PII detectors miss subtle identifiers.
- Logs capturing unmasked input before transformation.
- Vector stores still hold raw documents with PII, even if prompts are masked.
-
Single global “turn off data retention” switch.
- Teams set a vendor flag to disable training/data retention and assume compliance.
- Issues:
- Not all endpoints respect that flag the same way.
- Third-party tools and plugins may call other models with different defaults.
- Internal logging pipelines (APM, tracing) still hold sensitive content.
- Result: false confidence + blind spots in incident response.
-
“Governance as approval” instead of “governance as code.”
Example pattern:- Security/legal sign off a Confluence page describing AI use.
- Engineers then incrementally expand scope (“just add this table to the index”) without re-review.
- There is no automated check that:
- New data sources adhere to the same rules.
- Environment drift hasn’t opened new data paths.
- Outcome: design-time approval; runtime behavior diverges.
-
Over-logging without discrimination.
- All prompts, responses, and internal tool calls get dumped into observability systems.
- Those systems:
- Aren’t scoped for sensitive data
- Have longer retention than allowed by policy
- Are accessible by too many engineers
- This becomes the actual privacy risk, not the LLM vendor.
-
Treating “AI system” as a single asset.
Real-world scenario:- An “AI support bot” app:
- Uses three models (classification, retrieval, generation)
- Touches five data sources with different classifications
- The asset inventory lists this as one system with one risk profile.
- During a review or incident, nobody can answer:
- Which component accessed what data?
- Whether the generator ever saw raw tickets or just summaries?
- An “AI support bot” app:
Practical playbook (what to do in the next 7 days)
Assuming you already have at least one AI use in prod or close to it.
Day 1–2: Inventory and classify
-
List all AI interactions that touch real data.
Include:- Direct LLM API calls from services
- Internal tools (RAG apps, copilots)
- Analytics/experimentation pipelines
-
Classify each by data sensitivity and action power:
- Data sensitivity:
- Low: public docs, marketing copy
- Medium: internal docs, non-PII logs
- High: PII, financial data, customer-specific docs
- Action power:
- Read-only suggestion (no auto-actions)
- Human-in-the-loop actions (user approves)
- Automated actions (LLM can update systems)
- Data sensitivity:
-
Identify where data crosses trust boundaries.
- External vendors vs. internal models
- Regions/jurisdictions
- Production vs. lower environments
Day 3–4: Establish minimal policy-as-code guardrails
Start small; you don’t need a full platform.
-
Define 3–5 executable rules per sensitivity/action tier.
Example rules (in English, then code later):- High-sensitivity data:
- May not be sent to external LLMs.
- May only be indexed into stores with per-tenant isolation.
- Automated actions:
- Must not be executed based solely on LLM free-text; require structured checks.
- Must log action, inputs, and approvals.
- High-sensitivity data:
-
Choose an enforcement point and wire it.
- For outbound LLM calls:
- Introduce a single client library or gateway.
- Enforce:
- Allowed model list
- Data classification tags on payloads
- Basic checks: “no HIGH data to external providers”
- For data ingestion into vector stores:
- Require a data classification field.
- Refuse ingest if classification is missing or disallowed for that index.
- For outbound LLM calls:
-
Log policy decisions.
- Whenever a request is allowed/denied based on policy:
- Log: policy version, decision, reason, caller, data class, model.
- Make those logs queryable by security/ops.
- Whenever a request is allowed/denied based on policy:
Day 5–6: Align with SOC2/ISO expectations
Translate what you already did into control language.
-
Map to basic control themes:
- Access control:
- Who can configure models, add data sources, or change policies?
- Change management:
- How are model upgrades and policy updates reviewed?
- Logging & monitoring:
- Can you detect anomalous access patterns or policy violations?
- Data retention:
- How long are prompts and outputs stored, where, and by whom?
- Access control:
-
Document evidence you could show tomorrow:
- Diagrams of data flows for each AI system.
- Sample logs showing:
