Your RPA Bot Is a Liability: Shipping Real AI Automation in 2025

Why this matters this week
Three signals keep coming up in conversations with engineering leaders:
- Finance and ops teams are quietly turning off legacy RPA bots because maintenance cost > labor savings.
- “Copilots” have shipped to thousands of users, but measurable productivity gains are hard to prove.
- Infra and platform teams are being asked: “Can we wire this AI stuff into how work actually gets done?”
The center of gravity is shifting from chat with an AI to AI that does work end-to-end:
- Triggered by real events (invoice arrived, ticket created, build failed).
- Orchestrated across systems (CRM, ERP, GitHub, internal APIs).
- Designed for observability, rollback, and security like any other production service.
If you own production systems, the question isn’t “Should we use AI agents?” but:
- Which workflows are ready for AI automation?
- What architecture minimizes blast radius?
- How do we make this auditable enough that security and finance will sign off?
This post is about that layer: agents, workflows, orchestration, and replacing brittle RPA with something you’d actually monitor in Grafana.
What’s actually changed (not the press release)
Ignoring the marketing, three real shifts matter for production teams:
-
LLMs are finally good enough for “glue work”
- Interpreting vague human input (“renew support contract but cap spend at last year + 5%”).
- Normalizing messy data (email → structured ticket, PDF → line items).
- Choosing which tool/API to call based on context.
- This doesn’t mean “fully autonomous,” but it does mean you can automate the boring logic between systems that RPA never handled well.
-
Tooling for controlled tool-use is usable
- Modern models handle:
- Function calling / tools with structured arguments.
- Multi-step reasoning with intermediate state.
- This lets you build “narrow agents” that:
- Read: fetch context from APIs/DBs.
- Think: plan next 1–N steps.
- Act: invoke tools with validated parameters.
- You no longer need a home-grown prompt pile to make this workable.
- Modern models handle:
-
Workflows can be treated as code, not screenshots
- Instead of UI selectors (RPA) you can:
- Call your own services directly (REST/gRPC).
- Use event buses (Kafka, SNS, Pub/Sub) as triggers.
- Store workflow definitions as versioned code (YAML/DSL/TypeScript/Python).
- Failure handling, retries, compensating actions, and observability look like any other distributed system.
- Instead of UI selectors (RPA) you can:
The punchline: the ROI equation for AI automation flipped. You’re now constrained less by model capability and more by:
- Data access & security model.
- Good-enough process definition.
- Your team’s tolerance for partial automation + human review.
How it works (simple mental model)
Use this mental model: Agent = Policy + Tools + Guardrails, embedded in a workflow.
-
Policy (LLM + prompts + constraints)
- The LLM is the policy deciding:
- Which tool to call.
- With what parameters.
- When to ask a human.
- You encode:
- Scope: “You only handle invoices under $10k and US-based vendors.”
- Constraints: “Never change payment terms without explicit approval.”
- The LLM is the policy deciding:
-
Tools (your APIs and actions)
- Each tool is a narrow, deterministic capability:
get_invoice(vendor_id, invoice_id)create_ticket(summary, severity, assignee)update_payment_terms(vendor_id, net_days)
- Tools are:
- Authenticated & authorized.
- Validating inputs (types, ranges, referential integrity).
- Logged individually.
- Each tool is a narrow, deterministic capability:
-
Guardrails (validation + policy checks)
- Before executing a tool:
- Schema validation (types, enums).
- Business rules (“amount <= approval_limit”).
- After execution:
- Check response against expectations.
- Optionally run an audit LLM to summarize what changed in human-readable form.
- Before executing a tool:
-
Workflow (orchestration & lifecycle)
Typical pattern:-
Trigger
- Event: “New invoice file uploaded to S3.”
- Or: “New Jira ticket created with label = ‘customer-complaint’.”
-
Context gathering tools
- Extract data (OCR, parsing, classification).
- Enrich with internal data (vendor history, customer plan).
-
Decision + Planning (LLM)
- LLM chooses:
- “Route to Tier 1 support with suggested reply.”
- Or “Auto-approve and schedule payment.”
- Or “Escalate to finance; missing PO number.”
- LLM chooses:
-
Actions
- Invoke tools to:
- Update systems.
- Post comments.
- Send notifications.
- Invoke tools to:
-
Human-in-the-loop (optional)
- For medium/high-risk actions:
- Present a summary and a diff of proposed changes.
- Log a single “approve/reject” action.
- Over time, you shrink the review scope as confidence grows.
- For medium/high-risk actions:
-
-
Observability + control plane
- Treat an “AI-run workflow” like any other service:
- Traces: each tool call, each LLM step.
- Metrics: success rate, time-to-complete, human-intervention rate.
- Feature flags: enable/disable specific actions, per segment or per user cohort.
- Treat an “AI-run workflow” like any other service:
This is the level at which “agents” make sense in real businesses: not as magic employees, but as policy engines controlling safe, auditable tools inside workflows.
Where teams get burned (failure modes + anti-patterns)
Patterns from real deployments:
-
“Chatbot first” instead of “workflow first”
- Anti-pattern: Building a conversational bot and then bolting on actions.
- Result: Entangled prompts, unclear responsibilities, brittle behavior.
- Better: Start from a single workflow and ask:
- What are triggers?
- What data is needed?
- What actions are permitted?
- Where can an LLM add value?
-
No blast radius control
- Example: An “AI finance assistant” allowed to edit any vendor record.
- One subtle parsing bug → dozens of incorrect payment terms.
- Fix:
- Narrow scope (e.g., only vendors in a test region).
- Hard caps (max change/day, max dollar value / run).
- Approval gates for structural changes.
-
Hidden coupling to UI and layout (RPA 2.0)
- Teams try to “AI-ify” RPA: LLMs reading web pages, clicking buttons.
- Still fragile:
- UI changes → automation breaks.
- Hard to test.
- Better:
- Use internal APIs.
- If no API exists, consider adding thin services around core operations.
-
Unbounded reasoning loops
- Unconstrained “agents” that:
- Call tools repeatedly.
- Run up token and API bills.
- Time out without clear status.
- Guardrails:
- Max steps / workflow run.
- Max cost / run (estimated tokens * price).
- Clear failure mode: “Stop and ask a human.”
- Unconstrained “agents” that:
-
No alignment with actual business metrics
- “We processed 10k support tickets with AI!”
But:- CSAT unchanged.
- Resolution time flat.
- Use real KPIs:
- Time-to-resolution.
- Handle rate without human touch.
- Error/rollback rate.
- Net savings (infra + LLM cost vs. labor/time).
- “We processed 10k support tickets with AI!”
-
Security & compliance as an afterthought
- Common sins:
- Sending PII/PHI to external LLMs without DLP.
- Letting the agent “discover” internal systems via trial & error.
- Non-negotiables:
- Data classification: what can leave your VPC/region?
- Role-based access: agents are service accounts with least privilege.
- Explicit allowlists of tools and parameters per workflow.
- Common sins:
Practical playbook (what to do in the next 7 days)
Assuming you’re an engineering leader with limited cycles, here’s a focused plan.
Day 1–2: Choose 1–2 realistic workflows
Filter candidates with this checklist:
- High volume, low glamour:
- Invoice triage, vendor onboarding, low-tier support, QA triage, basic HR requests.
- Rules exist but are noisy / incomplete:
- Perfect for LLM “judgment” within guardrails.
- Current process spans ≥2 systems:
- Where RPA has been brittle: PDF → email → ERP, etc.
- Failure cost is bounded:
- A bad decision is reversible (change a ticket, flag a payment, not ship a rocket).
Pick one workflow where you can measure:
- Baseline throughput and error rate.
- Time-per-case.
- Current human touch percentage.
Day 3: Model it as a state machine
Ignore AI for a moment. Write the workflow in clear steps:
- States:
RECEIVED,ENRICHED,DECISION_NEEDED,APPROVED,EXECUTED,ESCALATED. - Transitions:
- What moves an item from one state to the next?
- What data do you need at each step?
- Identify where you need:
- Parsing (turn unstructured → structured).
- Classification (priority, category).
- Decision (route, approve, reject, request info).
These are your “LLM insertion points.”
Day 4: Define tools and guardrails
For each action that changes state, define:
- Tool schema:
- Inputs (typed, validated).
- Outputs (success/failure, any IDs).
- Hard rules:
- If
amount > 10000→ cannot auto-approve. - If
country not in {US, CA, UK}→ always escalate.
- If
- Access:
- Which service account / role executes this tool?
- What logging is mandatory (who, what, when, before/after)?
Add rate limits for safety:
- Max N automated executions/day in this workflow.
- Max spend/day on LLM/API calls.
Day 5: Wire in an LLM policy for one decision point
Pick just one decision step. Example: “Should we auto-approve this invoice?”
- Inputs to the model:
- Parsed invoice data.
- Vendor history (past late payments, dispute rate).
- Hard business rules (pass them in as structured context).
- Outputs:
decision: one of `
