Your RPA Bots Are Glass: Shipping Real AI Automation Without Burning Down Ops

Why this matters this week
The conversation has finally shifted from “let’s add a chatbot” to “can we automate actual work?”
In the last two weeks alone, I’ve seen:
- A 40‑person operations team cut ticket handling time in half with an LLM-driven workflow, without touching their core app.
- A B2B SaaS vendor quietly ship an “autonomous” onboarding copilot that completes 70% of customer setup steps end‑to‑end.
- A mid-market bank kill a year-long RPA expansion and redirect that budget into AI agents orchestrated over APIs.
None of these teams care about “agents” as a buzzword. They care that:
- Their queue is growing faster than headcount.
- Their brittle RPA bots break every time a vendor moves a button.
- Their CFO is done funding experiments that don’t move SLAs, NPS, or gross margin.
This post is about AI automation in that context: replacing fragile, click-recorded RPA with workflow-centric, API-first automation that uses LLMs where they’re strong and boxes them in where they’re not.
If you’re responsible for production systems, your questions are probably:
- Where can we safely let AI touch operations?
- What do we automate first without creating a compliance or reliability nightmare?
- How do we avoid building another unmaintainable automation zoo?
That’s what we’ll unpack.
What’s actually changed (not the press release)
Three shifts matter for real businesses this quarter:
-
LLMs are finally good enough for “messy middle” work.
Not just classification or chat, but:- Interpreting semi-structured emails, PDFs, support logs.
- Mapping that to concrete actions (“update these three systems”, “request missing data”, “escalate with context”).
- Writing precise API calls or SQL with guardrails.
This replaces a big chunk of what RPA did by pixel-scraping UIs.
-
Tooling for orchestration has grown up (a bit).
We now have decent patterns and libraries for:- Tool-calling / function-calling with schemas.
- Multi-step workflows that mix deterministic code with LLM decisions.
- Running agents as workers in queues, not as magical long-lived “brains”.
They’re still immature, but you no longer have to invent everything from scratch.
-
Vendors expose more real APIs and webhooks.
This is underrated. Ten years ago, RPA was necessary because critical apps had no APIs. Today:- Most SaaS your teams use has REST/GraphQL web APIs.
- Identity, logging, and observability have standard patterns.
That means you can build AI automation around stable interfaces instead of DOM selectors.
Result: the default stack for automation is shifting from:
Screen-scraping robots + brittle XPath + manual exception queues
to:
Event-driven workflows + API calls + LLMs for judgment / language / glue logic
Still fragile if done poorly, but failure modes are more visible and testable.
How it works (simple mental model)
Throw away the idea of a “super agent that does everything”. Use this mental model instead:
Events → Orchestrator → Tools → Humans (for edges)
-
Events: Something happens in your business systems.
Examples:- New customer signs a contract (CRM event).
- Ticket opened with specific tags (support system).
- Invoice rejected (ERP).
- Daily “worklist” batch generated.
These should be structured, not scraped off screens.
-
Orchestrator: A workflow engine decides what to do.
Think:- “For event type X, run workflow Y”
- Each workflow = a DAG of steps:
- Deterministic logic: IF/ELSE, loops, retries, timeouts.
- LLM-powered “decision” steps where rules are fuzzy.
This can be a dedicated workflow engine, or just a well-structured service with a queue.
-
Tools: The workflow calls specific capabilities (“tools”).
Tools are where actual work happens:- Read/write business APIs (CRM, billing, ticketing).
- Look up policies or reference data.
- Generate content (email drafts, summaries, redlines).
- Validate outputs.
Tools are:
- Typed and schema’d.
- Logged and observable.
- Testable without an LLM in the loop.
-
LLM steps: Use the model for judgment or translation, not authority.
Use models to:- Extract entities and intent from messy input.
- Choose among allowed tools, with explicit schemas.
- Draft content that a human or validator tool checks.
Don’t use models to:
- Directly issue irreversible actions with no constraints.
- Encode important business rules that no one can see.
-
Human-in-the-loop for edge cases and high-risk actions.
- Route ambiguous or high-impact steps to humans with context.
- Let humans correct and label; feed back into evaluation/training.
- Track deflection rate: % of tasks completed without human intervention.
If you like equations:
Total automation value ≈ (Volume × (Human time saved per item – New failure cost per item)) – Platform & ops overhead
This is why “just hook up an agent to our production database” rarely pencils out.
Where teams get burned (failure modes + anti-patterns)
1. Treating LLMs as business-logic engines
Symptoms:
– “We’ll prompt the model with our policy doc and trust it to decide refunds.”
– No explicit conditions in code; all encoded in natural language prompts.
Failure modes:
– Silent policy violations.
– Non-deterministic decisions that are impossible to debug.
– Compliance teams blocking rollout.
Better:
– Keep policy and thresholds in config or code.
– Use the LLM to describe how a case maps to policies, but have deterministic logic decide outcomes.
2. “Stealth RPA”: Using agents to drive UIs instead of APIs
Symptoms:
– An “agent” that logs into five web apps, clicks around, and copies values.
– Selectors and layout changes constantly break flows.
– Engineers become full-time bot babysitters.
Failure modes:
– Flaky automations that die on release day.
– Security risk from shared robot accounts and stored credentials.
Better:
– Default to APIs and webhooks.
– Only use UI automation where APIs truly don’t exist, and fence it off as legacy.
3. No evaluation harness or offline tests
Symptoms:
– “We gave it 10 sample tickets, looks good, let’s enable for everyone.”
– Prompt edits pushed directly to prod with no regression check.
Failure modes:
– Gradual quality decay as prompts and models change.
– Surprise behaviors when underlying model updates.
Better:
– Maintain a test set of real, de-identified examples per workflow.
– For each change:
– Run offline evaluation: exact-match, classification accuracy, or “action correctness”.
– Review diff on notable failures.
– Track per-workflow metrics over time.
4. Logging just text, not decisions and tools
Symptoms:
– You only log the conversation transcript, not which APIs were called, with what args, and what failed.
Failure modes:
– Painful incident investigations (“why did it cancel the subscription?”).
– Inability to answer compliance questions about specific actions.
Better:
– Log at the action level:
– Event ID
– Workflow name + version
– Each step, whether LLM or deterministic:
– Inputs (sanitized)
– Tool calls with arguments
– Outputs and errors
– Human approvals or overrides
5. Ignoring cost and latency until the bill arrives
Symptoms:
– Every step is “ask the LLM what to do”.
– Large prompts with entire ticket histories and documents each time.
Failure modes:
– 10–100x higher cost than necessary.
– Latency so high users disable the copilot.
Better:
– Use LLMs sparingly:
– Precompute embeddings for search instead of re-sending raw docs.
– Cache intermediate interpretations where possible.
– For high-volume workflows, consider smaller or specialized models for simple subtasks.
Practical playbook (what to do in the next 7 days)
Assume you’re an engineering leader who wants impact this quarter, not next year.
1. Pick one high-ROI workflow (not a department)
Criteria:
– Repeated 1000+ times/month.
– Clear definition of “success” (e.g., ticket resolved, invoice approved).
– Input is messy (emails, attachments) but output is structured actions.
– Today dominated by copy-paste and lookups.
Examples I see working:
– Support triage and resolution drafting
Auto-classify, route, and draft responses + ticket updates.
– Customer onboarding checklist execution
Given a signed contract, set up systems, permissions, and notifications.
– Invoice exception handling
Parse vendor invoices, match to POs, flag and propose resolution steps.
Pick one. Document it as a state machine on a whiteboard.
2. Map the “happy path” and 3 top exception types
For the chosen workflow, write:
-
Happy path:
- Start event
- 3–8 steps
- End condition
-
Top exceptions:
- Missing data
- Conflicting systems of record
- Policy edge cases
Decide now where humans must always be in the loop (e.g., refunds over $X, irreversible account actions).
3. Define tools before prompts
For this workflow, design tools like you’d design internal APIs:
- For each required action:
get_ticket(id),update_ticket(id, fields),create_task(...)fetch_customer_profile(email)send_email(template_id, params)
Each tool:
– Has a strict input schema.
– Does its own validation and error classification.
– Is safe to call multiple times (idempotency where possible).
Your LLM’s job: fill these tool arguments, not decide arbitrary JSON structures from scratch.
4. Stand up a minimal orchestrator
You don’t need a full-featured platform to start. For v1:
- A queue of incoming events.
- A simple worker service that:
- Pulls an event.
- Executes a hard-coded workflow graph for that event type.
- Calls the LLM only when:
- Interpreting unstructured input.
- Choosing a branch among predefined options.
- Logs all steps.
Aim for:
– One workflow definition file.
– One tools module.
– One LLM client module (swappable model config).
5. Build an offline evaluation set before going live
From your past month of data:
- Sample 200 real
