Stop Treating Fintech Infrastructure Like a SaaS Subscription

Why this matters this week
Fintech infra (payments routing, fraud, AML/KYC, risk, compliance, open banking) has quietly crossed a line:
- The regulatory bar is rising (Travel Rule enforcement, real-name verification, sanctions focus, PSD2/3 & RTS evolution).
- Card networks and banks are tightening risk appetites and raising scheme fines.
- Venture money is no longer subsidizing margin: your cost per transaction and cost of false positives are now board-level metrics.
- And you’re almost certainly running “mission-critical” logic in a patchwork of vendors + scripts that nobody fully owns.
For a lot of companies, what was set up as “we’ll just plug into X and move on” has become a production dependency that:
- Drives unit economics (interchange, FX, chargebacks, fraud loss, KYC costs).
- Affects uptime and customer churn directly.
- Is now sufficiently complex that it’s a system, not an integration.
If you’re a CTO / head of engineering in fintech, this week’s work often includes one or more of:
- A new geography launch that “just needs another PSP, it’s easy.”
- A compliance team escalation about sanctions or PEP screening coverage.
- Finance complaining that fraud losses and chargebacks don’t reconcile with reporting.
- A card program manager warning that your dispute ratios are flirting with thresholds.
This post is about treating your fintech infrastructure like core infra — not like a set of SaaS checkboxes you bought in 2021.
What’s actually changed (not the press release)
Three concrete shifts in the last 12–18 months matter for payments, fraud, AML/KYC, and risk:
-
Risk & compliance are no longer separable from product velocity
- Historically: “Compliance is a gate at the edge.” Engineers built flows; compliance appended rules.
- Now: Real-time AML monitoring, sanctions screening, and transaction risk scoring are inline with:
- Payment routing decisions
- Limits & holds
- Payout schedules
- That means latency, uptime, and data design for these systems directly shape customer experience (e.g., instant payouts vs. “pending review”).
-
The regulator-vendor gap narrowed
Regulators increasingly assume:
- You actually understand what your fraud / AML vendors do.
- You can explain and tune the models / rules.
- You can prove coverage (sanctions, adverse media, transaction monitoring scenarios).
“Our vendor handles it” is no longer safe. Recent exams and enforcement actions focus on:
- Over-reliance on black-box vendors with no internal second line understanding.
- Poor data quality feeding these systems (missing fields, wrong timestamps, no device data).
- No systematic backtesting or effectiveness reviews.
-
Economic pressure made infra quality visible
With free money gone, people are discovering:
- Latent PSP margin leakage (e.g., no smart routing, no retry logic, no MCC/issuer optimization).
- Overly aggressive fraud controls causing double-digit revenue loss via false positives.
- AML vendors billing per-screen or per-hit with no dedupe strategy or caching, inflating costs.
A 1–2% recovery on payments acceptance, or a 20–30% reduction in false positives, is often worth more than the next greenfield feature.
How it works (simple mental model)
Stop thinking “payments provider + fraud tool + KYC vendor.”
Start thinking in four layers:
-
Event capture layer
This is your source of truth for financial events across systems:
- Auths, captures, voids, refunds
- Wallet top-ups, payouts, chargebacks, disputes
- KYC/KYB onboarding events, document checks, PEP/sanctions hits
Properties:
- Append-only, immutable events (with corrections as new events, not updates).
- Globally unique IDs, idempotent ingestion, stable schemas.
- Timestamps that are unambiguous (UTC, with source-time vs processing-time separated).
If you skip this layer, you’ll end up reconciling CSV exports from:
- PSP A
- PSP B
- Fraud vendor
- Core ledger
…in Excel at month-end.
-
Decision engine layer
Real-time and near-real-time decisions:
- Should we allow this payment?
- Should we ask for additional KYC?
- Should we raise this user’s limit?
- Should we delay this payout pending checks?
Architecturally:
- Stateless services using feature stores built on the event stream.
- Mix of rules (deterministic) and models (probabilistic).
- Side-effects: flags, holds, risk scores, case creation.
This should be vendor- and channel-agnostic:
- Same logic regardless of PSP, card vs ACH, web vs mobile.
- Vendors become data sources and actuators, not owners of your policy.
-
Execution layer (rails & orchestration)
This is where actual money moves and compliance checks run externally:
- PSPs, banks, card networks, open banking aggregators.
- KYC vendors, document verification, AML/sanctions providers.
- Card processors, scheme tokenization, network token vaults.
Your job here:
- Normalize responses into your event schema.
- Implement resilience (retries with idempotency keys, fallback routing).
- Model external SLAs and failure modes (maintenance windows, rate limits).
-
Assurance layer (governance, audit, analytics)
Cross-cutting systems that:
- Provide auditable trails for decisions (which rules fired, what features).
- Support compliance testing (scenario backtesting, effectiveness reviews).
- Feed product & finance with reliable metrics (loss, recovery, chargeback rates, sanction hit rates).
This is where you:
- Detect model and rule drift.
- Prove to auditors and regulators that your controls work.
- Avoid “we have no idea why this customer was blocked” incidents.
If you map your current stack into these four layers, you’ll usually find at least one:
- Vendor that is silently doing work in multiple layers (e.g., fraud vendor also making routing choices).
- Missing persistence or events that make post-hoc analysis impossible.
Where teams get burned (failure modes + anti-patterns)
1. Blind vendor delegation
Pattern:
- “We bought a KYC + AML SaaS. They’re the experts.”
- No one internally can explain:
- How watchlists are updated, with what latency.
- What a risk score of 72 actually means.
- What fields are passed and which are consistently null.
Impact:
- Examiners ask questions you can’t answer.
- Performance plateaus because you never tune or enrich.
Anti-pattern variant:
- Using a fraud provider’s SDK in the frontend that sends rich device telemetry, but your backend:
- Doesn’t link device IDs to transactions reliably.
- Drops half the metadata before it reaches the decision engine.
2. Local optimizations that break globally
Example patterns:
- Product team boosts acceptance by loosening 3DS / SCA logic to hit growth goals.
- Fraud team tightens rules to hit quarterly loss targets.
- Compliance adds more “hard stops” in AML scenarios.
Nobody owns global objective functions like:
– Loss + cost of operations + cost of capital + revenue.
Result:
You get thrash:
More disputes, more manual review, more user complaints, higher scheme fines.
3. Multiple PSPs, single-brain failure
Common for global merchants and marketplaces:
- You integrate 3–5 PSPs and an acquirer for redundancy and cost arbitrage.
- But routing is static (geo or percentage-based).
- Retries are naive (same PSP, same parameters).
Failure modes:
- PSP outage in EU means 40% of global volume fails.
- Regulatory change (e.g., SCA) handled differently by each PSP; you can’t normalize or experiment.
4. KYC/AML as “signup friction” only
Risky pattern:
- You do full KYC at signup, then… nothing.
- Transaction monitoring is a checkbox in a third-party dashboard nobody regularly reviews.
- No periodic KYC refresh, no dynamic risk-based KYC levels.
This is explicitly called out in multiple enforcement actions now.
You need ongoing monitoring, not just one-time checks.
5. No testable models and rules
- Rules are edited manually in vendor UIs with no version control.
- Models are retrained ad-hoc with no A/B or shadow deployments.
- No offline replay framework to test policy changes on historical data.
So you avoid changes because you fear breakage, which leads to frozen risk posture that decays over time.
Practical playbook (what to do in the next 7 days)
You can’t fix everything this week. You can get out of “total unknown” territory.
1. Draw the real architecture, not the pitch deck (½ day)
In one diagram (whiteboard is fine):
- List:
- Payment rails (PSPs, acquirers, bank partners, open banking).
- Fraud / risk / AML / KYC vendors.
- Internal services that touch payment state or user trust (ledger, wallets, payouts, support tools).
- For each integration, mark:
- Synchronous vs async
- What data you send/receive (card details, device, IP, identities, docs)
- Where responses are persisted.
Deliverable: one-page diagram that everyone agrees is roughly correct.
2. Identify your event capture gaps (1 day)
From that diagram:
- List all financially or regulatorily relevant events:
- Payments lifecycle, disputes, chargebacks, charge reversals.
- Onboarding events, KYC results, sanctions hits / overrides.
- Fraud / AML alerts, rule fires, case creation/resolution.
For each:
- Do we have a single, queryable, append-only representation?
- Is it:
- Complete? (all rails/vendors)
- Timely?
- Joined to customer and account IDs?
Pick 2–3 high-value events where the answer is “no” and create tickets to:
- Add event emission at the source.
- Land into your chosen store (data warehouse or event log).
3. Baseline three metrics that actually matter (1–2 days)
Forget 40 KPIs. Start with:
- Payment acceptance rate, by:
- Rail / PSP
- Country
- Device / channel
- Fraud economics:
- Fraud loss as % of processed volume.
- Chargebacks as % of processed volume (by scheme).
- **AML/KYC
