Your LLM Stack Is Probably Non-Compliant by Default

Table of Contents

Why this matters this week

Two things are happening at once:

Your org is quietly shoving more critical data into LLMs
- Customer tickets, contracts, financials, source code, HR documents.
- Often via “experimental” side projects, browser extensions, or internal tools.
- Very few of these flows are wired into your existing data retention, access control, or audit pipelines.
Security/compliance teams are starting to ask pointed questions, not theoretical ones
- “Where exactly does the prompt and output get logged?”
- “Can we prove this model never saw production PII?”
- “What’s the retention policy on embeddings in that vector DB?”
- “Does this align with our SOC2 controls and ISO 27001 Annex A, or is it a parallel universe?”

The gap between “we have policies” and “our LLM stack actually follows them” is widening. That’s the governance problem in 2024–2025: policy-to-runtime drift for AI systems.

If you don’t tackle this now, you’ll end up in one of two bad equilibria:

Freeze: AI usage stalls because security says “no” by default.
Shadow AI: Teams bypass central platforms and wire the APIs directly, creating an un-auditable mess.

You want a third path: policy-as-code for AI that your auditors, lawyers, and engineers can all reason about.

What’s actually changed (not the press release)

The “new” thing is not regulations per se; it’s the interaction between LLMs’ hungry data surfaces and your existing governance scaffolding.

Three concrete shifts:

LLMs massively expand what counts as “in-scope data”
Historically you could scope security systems around:
- Source of truth DBs (Postgres, Snowflake, S3)
- Log platforms (ELK, Datadog, Splunk)
- Analytics/BI stack
Now you have:
- Prompt payloads containing free-form text that often bundles PII, secrets, and business-sensitive context.
- Model training or fine-tuning pipelines ingesting production data.
- Embeddings stored in vector DBs that are searchable but not easily inspectable.
- “Memory” features that cache conversation context.
Net effect: Any random textarea in your internal tools can become a data exfil path and a retention nightmare.
Vendors changed their default posture (and it’s easy to misread)
Many API-based LLM providers now advertise:
- “We don’t train on your data by default.”
- “You can configure regional data residency.”
- “We have SOC2 / ISO / HIPAA-ish stories.”
What changed:
- Storage / training defaults are better than a year ago.
  What did not change:
- Your responsibility to control what goes into prompts and where prompts/outputs are logged, cached, or backed up.
- Your need to map these flows to your own SOC2/ISO control set.
Audit expectations are catching up to AI-specific risks
Early SOC2/ISO audits largely treated AI as “just another SaaS.” That’s fading. Auditors and internal risk teams are starting to ask:
- Do you have a data classification policy that covers prompts, embeddings, and model artifacts?
- Are model inputs/outputs treated consistently with log data in your retention schedule?
- How do you enforce least privilege against AI features (who can send what to which model)?
- Can you show traceability: from a user request to the model calls to the data sources consulted?

Governance for AI systems is maturing from slideware to “show me where in the code and Terraform this is enforced.”

How it works (simple mental model)

Use this mental model: Every LLM feature is a pipeline with four control points:

Ingress (what can enter the model boundary)
- User inputs, system prompts, retrieved documents, tool outputs.
- Risks: PII leakage, secrets, regulated data crossing borders.
- Controls:
  - Data classification and redaction before the model.
  - Allow/deny logic on which data sources can be retrieved.
  - Policy that says “this field never leaves this region/tenant.”
Processing (what the model can do with the data)
- Base models, fine-tuned models, RAG, tools.
- Risks:
  - Training data contamination.
  - Irreversible mixing of tenants in a shared model.
  - Unreviewed tools that can mutate production.
- Controls:
  - Clear separation of environments: eval, staging, prod.
  - Per-model and per-tool allowlists.
  - Decisions: no cross-tenant fine-tunes; prefer retrieval over fine-tune for multi-tenant.
Persistence (what gets stored and for how long)
- Logs (prompts/responses), embeddings, caches, feature stores, training datasets, model checkpoints.
- Risks:
  - Prompts with PII kept for 2 years in logs “for debugging.”
  - Vector DB backups stored with weaker controls than the source DB.
  - Model snapshots containing now-forgotten regulated data.
- Controls:
  - Explicit retention periods for each data class.
  - Tagging: “prompt-log:contains-PII” vs “prompt-log:safe-for-long-term.”
  - Automated deletion/TTL, not manual “we usually remember.”
Egress (what leaves the system and who can see it)
- LLM responses, derived analytics, monitoring dashboards, incident reports.
- Risks:
  - Models summarizing sensitive data and leaking to unintended recipients.
  - Copilot-like features exposing cross-tenant information.
- Controls:
  - Result filtering/guardrails (PII detectors, content filters).
  - Access control around generated artifacts (who can see that AI-generated summary?).

Governance = policies + code + evidence at each of the four points.

If you can map ingress, processing, persistence, and egress to:
– Concrete controls (code, config, infra), and
– Concrete evidence (logs, dashboards, attestations),

you’re in a good place for SOC2/ISO alignment and internal risk management.

Where teams get burned (failure modes + anti-patterns)

1. “We sanitized the model, but not the logs”

Pattern:
– Product team uses a hosted LLM with “no training on your data.”
– They even do PII redaction before sending the prompt to the model.
– But the unredacted text is logged by:
– Web servers
– APM/observability tools
– Prompt-logging middleware

Result:
You have a shadow data lake of sensitive prompts in multiple log systems, with no retention rules or auditing.

Anti-pattern indicators:
– log.info("User prompt: {}", prompt) scattered in code.
– Prompt payloads visible in third-party monitoring tools.

2. “Embedding everything, classifying nothing”

Pattern:
– Team builds RAG: “Index all customer docs in a vector DB.”
– No data classification. No tenant separation. No retention policy.
– AI assistant for internal ops suddenly answers:

“Customer X’s contract was cancelled for non-payment…”

Result:
Cross-tenant leakage via retrieval; impossible-to-explain behavior to Legal and Sales.

Anti-pattern indicators:
– Single shared vector index for all tenants.
– No metadata fields like tenant_id, classification, retention_until.

3. “Fine-tune first, ask questions later”

Pattern:
– Some group fine-tunes a model on production tickets and chat transcripts to get better support responses.
– No review of:
– PII/PHI presence.
– Regulatory scope of the data.
– Whether the fine-tune will be reused across tenants.

Result:
You’re now stuck with a model artifact that may contain regulated data, with no playbook for:
– Deletion on customer request.
– Geographic isolation.
– Demonstrating to auditors what data went in.

Anti-pattern indicators:
– No dataset manifest or lineage for fine-tune jobs.
– Fine-tune jobs kicked off from notebooks without CI/CD or approvals.

4. “Policy on Confluence, reality in YAML”

Pattern:
– Security/compliance writes strong policies:
– “All PII must be retained for 1 year, then deleted.”
– “All access to production data must be logged and reviewed.”
– LLM platform is configured ad hoc:
– Default unlimited log retention.
– No masking of secrets or PII.
– Policies are not encoded in code or infra.

Result:
SOC2 / ISO auditors ask for proof, and you have nothing except docs and goodwill.

Anti-pattern indicators:
– No tests or CI checks around governance.
– Terraform/Helm values have no link to policy IDs or control IDs.

Practical playbook (what to do in the next 7 days)

Assume you have limited time and political capital. Focus on two goals:

Make data flows visible and classifiable.
Turn 2–3 critical policies into code.

Day 1–2: Map the AI data surface

Inventory AI entry points
- Where in your product can text/code/documents be sent to an LLM?
- Include: web UIs, CLIs, browser extensions, Slack bots, internal tools.
For each entry point, answer:
- What data types can appear (PII, PHI, secrets, financials, source code)?
- Which model/provider is called?
- What gets logged and where?
- Are prompts/outputs persisted (databases, object storage, vector DBs)?

Write this as a simple table. This is your AI data map.

Day 3–4: Implement two hard controls end-to-end

Pick two concrete policies and wire them fully from doc → code → evidence.

Suggested starting policies:

Prompt logging policy
- Rule: “Prompts with PII/secrets must not be stored in plaintext logs.”
  Implementation ideas:
- Add a middleware that:
  - Runs lightweight PII/secret detection.
  - Masks or drops risky prompts before logging.
  - Tags logs with prompt_contains_sensitive=true/false.
- Configure log retention:
  - e.g., 30 days for sensitive-tagged entries, 180+ days for non-sensitive.
Evidence:
- Show log

Your LLM Stack Is Probably Non-Compliant by Default

Why this matters this week

What’s actually changed (not the press release)

How it works (simple mental model)

Where teams get burned (failure modes + anti-patterns)

1. “We sanitized the model, but not the logs”

2. “Embedding everything, classifying nothing”

3. “Fine-tune first, ask questions later”

4. “Policy on Confluence, reality in YAML”

Practical playbook (what to do in the next 7 days)

Day 1–2: Map the AI data surface

Day 3–4: Implement two hard controls end-to-end

Stop Treating AI Governance as a PDF Problem

Policy-as-Code or Policy-as-Slide-Deck? Getting Serious About AI Data Governance

Policy-as-Code or Policy-as-PPT? Getting Real About AI Privacy & Governance

Stop Hoarding Prompts: A Pragmatic Guide to AI Privacy & Governance

Your LLM Stack Is Leaking: A Pragmatic Guide to Privacy & Governance

Your LLM Governance Is Probably Fake: How to Make It Real Before Audit Season

Why this matters this week

What’s actually changed (not the press release)

How it works (simple mental model)

Where teams get burned (failure modes + anti-patterns)

1. “We sanitized the model, but not the logs”

2. “Embedding everything, classifying nothing”

3. “Fine-tune first, ask questions later”

4. “Policy on Confluence, reality in YAML”

Practical playbook (what to do in the next 7 days)

Day 1–2: Map the AI data surface

Day 3–4: Implement two hard controls end-to-end

Similar Posts