Your RPA Bots Are Glass: Shipping Real AI Automation Without Burning Down Ops

Table of Contents

Why this matters this week

The conversation has finally shifted from “let’s add a chatbot” to “can we automate actual work?”

In the last two weeks alone, I’ve seen:

A 40‑person operations team cut ticket handling time in half with an LLM-driven workflow, without touching their core app.
A B2B SaaS vendor quietly ship an “autonomous” onboarding copilot that completes 70% of customer setup steps end‑to‑end.
A mid-market bank kill a year-long RPA expansion and redirect that budget into AI agents orchestrated over APIs.

None of these teams care about “agents” as a buzzword. They care that:

Their queue is growing faster than headcount.
Their brittle RPA bots break every time a vendor moves a button.
Their CFO is done funding experiments that don’t move SLAs, NPS, or gross margin.

This post is about AI automation in that context: replacing fragile, click-recorded RPA with workflow-centric, API-first automation that uses LLMs where they’re strong and boxes them in where they’re not.

If you’re responsible for production systems, your questions are probably:

Where can we safely let AI touch operations?
What do we automate first without creating a compliance or reliability nightmare?
How do we avoid building another unmaintainable automation zoo?

That’s what we’ll unpack.

What’s actually changed (not the press release)

Three shifts matter for real businesses this quarter:

LLMs are finally good enough for “messy middle” work.
Not just classification or chat, but:
- Interpreting semi-structured emails, PDFs, support logs.
- Mapping that to concrete actions (“update these three systems”, “request missing data”, “escalate with context”).
- Writing precise API calls or SQL with guardrails.
  This replaces a big chunk of what RPA did by pixel-scraping UIs.
Tooling for orchestration has grown up (a bit).
We now have decent patterns and libraries for:
- Tool-calling / function-calling with schemas.
- Multi-step workflows that mix deterministic code with LLM decisions.
- Running agents as workers in queues, not as magical long-lived “brains”.
  They’re still immature, but you no longer have to invent everything from scratch.
Vendors expose more real APIs and webhooks.
This is underrated. Ten years ago, RPA was necessary because critical apps had no APIs. Today:
- Most SaaS your teams use has REST/GraphQL web APIs.
- Identity, logging, and observability have standard patterns.
  That means you can build AI automation around stable interfaces instead of DOM selectors.

Result: the default stack for automation is shifting from:

Screen-scraping robots + brittle XPath + manual exception queues

to:

Event-driven workflows + API calls + LLMs for judgment / language / glue logic

Still fragile if done poorly, but failure modes are more visible and testable.

How it works (simple mental model)

Throw away the idea of a “super agent that does everything”. Use this mental model instead:

Events → Orchestrator → Tools → Humans (for edges)

Events: Something happens in your business systems.
Examples:
- New customer signs a contract (CRM event).
- Ticket opened with specific tags (support system).
- Invoice rejected (ERP).
- Daily “worklist” batch generated.
These should be structured, not scraped off screens.
Orchestrator: A workflow engine decides what to do.
Think:
- “For event type X, run workflow Y”
- Each workflow = a DAG of steps:
  - Deterministic logic: IF/ELSE, loops, retries, timeouts.
  - LLM-powered “decision” steps where rules are fuzzy.
    This can be a dedicated workflow engine, or just a well-structured service with a queue.
Tools: The workflow calls specific capabilities (“tools”).
Tools are where actual work happens:
- Read/write business APIs (CRM, billing, ticketing).
- Look up policies or reference data.
- Generate content (email drafts, summaries, redlines).
- Validate outputs.
Tools are:
- Typed and schema’d.
- Logged and observable.
- Testable without an LLM in the loop.
LLM steps: Use the model for judgment or translation, not authority.
Use models to:
- Extract entities and intent from messy input.
- Choose among allowed tools, with explicit schemas.
- Draft content that a human or validator tool checks.
Don’t use models to:
- Directly issue irreversible actions with no constraints.
- Encode important business rules that no one can see.
Human-in-the-loop for edge cases and high-risk actions.
- Route ambiguous or high-impact steps to humans with context.
- Let humans correct and label; feed back into evaluation/training.
- Track deflection rate: % of tasks completed without human intervention.

If you like equations:

Total automation value ≈ (Volume × (Human time saved per item – New failure cost per item)) – Platform & ops overhead

This is why “just hook up an agent to our production database” rarely pencils out.

Where teams get burned (failure modes + anti-patterns)

1. Treating LLMs as business-logic engines

Symptoms:
– “We’ll prompt the model with our policy doc and trust it to decide refunds.”
– No explicit conditions in code; all encoded in natural language prompts.

Failure modes:
– Silent policy violations.
– Non-deterministic decisions that are impossible to debug.
– Compliance teams blocking rollout.

Better:
– Keep policy and thresholds in config or code.
– Use the LLM to describe how a case maps to policies, but have deterministic logic decide outcomes.

2. “Stealth RPA”: Using agents to drive UIs instead of APIs

Symptoms:
– An “agent” that logs into five web apps, clicks around, and copies values.
– Selectors and layout changes constantly break flows.
– Engineers become full-time bot babysitters.

Failure modes:
– Flaky automations that die on release day.
– Security risk from shared robot accounts and stored credentials.

Better:
– Default to APIs and webhooks.
– Only use UI automation where APIs truly don’t exist, and fence it off as legacy.

3. No evaluation harness or offline tests

Symptoms:
– “We gave it 10 sample tickets, looks good, let’s enable for everyone.”
– Prompt edits pushed directly to prod with no regression check.

Failure modes:
– Gradual quality decay as prompts and models change.
– Surprise behaviors when underlying model updates.

Better:
– Maintain a test set of real, de-identified examples per workflow.
– For each change:
– Run offline evaluation: exact-match, classification accuracy, or “action correctness”.
– Review diff on notable failures.
– Track per-workflow metrics over time.

4. Logging just text, not decisions and tools

Symptoms:
– You only log the conversation transcript, not which APIs were called, with what args, and what failed.

Failure modes:
– Painful incident investigations (“why did it cancel the subscription?”).
– Inability to answer compliance questions about specific actions.

Better:
– Log at the action level:
– Event ID
– Workflow name + version
– Each step, whether LLM or deterministic:
– Inputs (sanitized)
– Tool calls with arguments
– Outputs and errors
– Human approvals or overrides

5. Ignoring cost and latency until the bill arrives

Symptoms:
– Every step is “ask the LLM what to do”.
– Large prompts with entire ticket histories and documents each time.

Failure modes:
– 10–100x higher cost than necessary.
– Latency so high users disable the copilot.

Better:
– Use LLMs sparingly:
– Precompute embeddings for search instead of re-sending raw docs.
– Cache intermediate interpretations where possible.
– For high-volume workflows, consider smaller or specialized models for simple subtasks.

Practical playbook (what to do in the next 7 days)

Assume you’re an engineering leader who wants impact this quarter, not next year.

1. Pick one high-ROI workflow (not a department)

Criteria:
– Repeated 1000+ times/month.
– Clear definition of “success” (e.g., ticket resolved, invoice approved).
– Input is messy (emails, attachments) but output is structured actions.
– Today dominated by copy-paste and lookups.

Examples I see working:
– Support triage and resolution drafting
Auto-classify, route, and draft responses + ticket updates.
– Customer onboarding checklist execution
Given a signed contract, set up systems, permissions, and notifications.
– Invoice exception handling
Parse vendor invoices, match to POs, flag and propose resolution steps.

Pick one. Document it as a state machine on a whiteboard.

2. Map the “happy path” and 3 top exception types

For the chosen workflow, write:

Happy path:
- Start event
- 3–8 steps
- End condition
Top exceptions:
- Missing data
- Conflicting systems of record
- Policy edge cases

Decide now where humans must always be in the loop (e.g., refunds over $X, irreversible account actions).

3. Define tools before prompts

For this workflow, design tools like you’d design internal APIs:

For each required action:
- get_ticket(id), update_ticket(id, fields), create_task(...)
- fetch_customer_profile(email)
- send_email(template_id, params)

Each tool:
– Has a strict input schema.
– Does its own validation and error classification.
– Is safe to call multiple times (idempotency where possible).

Your LLM’s job: fill these tool arguments, not decide arbitrary JSON structures from scratch.

4. Stand up a minimal orchestrator

You don’t need a full-featured platform to start. For v1:

A queue of incoming events.
A simple worker service that:
- Pulls an event.
- Executes a hard-coded workflow graph for that event type.
- Calls the LLM only when:
  - Interpreting unstructured input.
  - Choosing a branch among predefined options.
- Logs all steps.

Aim for:
– One workflow definition file.
– One tools module.
– One LLM client module (swappable model config).

5. Build an offline evaluation set before going live

From your past month of data:

Sample 200 real

Your RPA Bots Are Glass: Shipping Real AI Automation Without Burning Down Ops

Why this matters this week

What’s actually changed (not the press release)

How it works (simple mental model)

Where teams get burned (failure modes + anti-patterns)

1. Treating LLMs as business-logic engines

2. “Stealth RPA”: Using agents to drive UIs instead of APIs

3. No evaluation harness or offline tests

4. Logging just text, not decisions and tools

5. Ignoring cost and latency until the bill arrives

Practical playbook (what to do in the next 7 days)

1. Pick one high-ROI workflow (not a department)

2. Map the “happy path” and 3 top exception types

3. Define tools before prompts

4. Stand up a minimal orchestrator

5. Build an offline evaluation set before going live

Stop Calling It an “Agent” If It’s Just a Script with LLM Glue

From RPA Graveyards to Real AI Automation: What’s Actually Working

The Unsexy Reality of AI Automation: Where It’s Actually Working (and Failing)

Stop Gluing Scripts Together: A Practical Look at AI Automation That Actually Ships

Stop Calling It an “Agent”: Shipping Real AI Automation Instead of Demo Theater

Stop Calling It “Agents” If It’s Just a Cron Job with a GPT Call

Why this matters this week

What’s actually changed (not the press release)

How it works (simple mental model)

Where teams get burned (failure modes + anti-patterns)

1. Treating LLMs as business-logic engines

2. “Stealth RPA”: Using agents to drive UIs instead of APIs

3. No evaluation harness or offline tests

4. Logging just text, not decisions and tools

5. Ignoring cost and latency until the bill arrives

Practical playbook (what to do in the next 7 days)

1. Pick one high-ROI workflow (not a department)

2. Map the “happy path” and 3 top exception types

3. Define tools before prompts

4. Stand up a minimal orchestrator

5. Build an offline evaluation set before going live

Similar Posts