Your RPA Bots Are Shell Scripts With delusions of Grandeur

Table of Contents

Why this matters right now

If you run a real business, you already automate things:

RPA bots clicking through legacy UIs
Cron jobs, ETL pipelines, glue code
Humans following 40‑step SOPs in Playbooks/Confluence

Those systems are:

Brittle – small UI/API changes break everything
Opaque – nobody knows what the bots are doing at 3am
Narrow – every exception path becomes a new ticket

AI automation (agents, copilots, orchestrated workflows) promises to replace this with something:

More flexible than RPA
Cheaper than throwing more humans at the queue
Faster to build than another bespoke microservice

But most “AI agent” pitches ignore:

Determinism vs. stochastic behavior
Security and data boundaries
Observability, rollback, and incident handling
Total cost of ownership: inference, integration, ops

This post is about how to reason concretely about AI automation in production: where it fits, where it fails, and how to experiment without lighting your hair on fire.

What’s actually changed (not the press release)

The technology shift is real, but narrower than the hype.

Three things that materially impact production systems:

LLMs can now act as “universal glue” across messy systems
- Parse semi‑structured documents, emails, logs, PDFs
- Call tools / APIs based on natural language instructions
- Generate structured outputs (JSON, SQL, DSLs) with reasonable reliability
This replaces a lot of regex‑driven pipelines and custom parsers with a model call plus a schema.
Tool-calling and function orchestration are practical

Old chatbots: “Here’s a string, good luck.”

New stack:
- “Given this user request, pick a tool, call it with typed arguments, use the result, possibly call another tool.”
- This allows composable workflows instead of single-shot prompts.
You can now have a model reliably do:
- Look up a customer
- Validate entitlements
- Draft an action
- Ask for human approval only when needed
Inference economics improved just enough
- You can run small or mid‑size models in your VPC or on cheaper GPUs/CPUs.
- Smaller models (or distilled versions) often suffice for narrow business processes.
- For many back‑office workflows, “good enough” accuracy at lower cost beats frontier-model perfection.

What has not fundamentally changed:

LLMs are still probabilistic and can be wrong with high confidence.
Out-of-the-box “autonomous agents” remain unsafe for anything high‑impact.
You still need engineering around them: validation, guardrails, monitoring, versioning.

How it works (simple mental model)

Strip away the jargon. An “AI agent” system is basically:

A state machine where some transitions are decided by an LLM instead of a switch statement.

Four building blocks:

Inputs
- User messages, emails, support tickets
- Documents (contracts, invoices, PDFs, logs)
- Events from your systems (paymentfailed, shipmentdelayed)
Orchestrator
- A workflow engine / state machine: Temporal, Airflow, custom orchestrator, or a SaaS agent framework.
- Knows:
  - Current state of the workflow
  - Available actions (tools, API calls)
  - When to ask the LLM for help vs. follow a deterministic rule
LLM + Tools

The LLM is used for a few specific capabilities:
- Classification and routing
  - “Which playbook should I run for this email?”
  - “Is this a refund, technical issue, or sales request?”
- Field extraction
  - “Pull out invoice number, total, due date, vendor from this PDF.”
- Decision within a bounded context
  - “Given policy X and customer data Y, is this request allowed?”
- Generation
  - “Draft an email to the customer explaining the decision.”
- Tool invocation
  - Model gets a menu of functions; decides:
    - Which to call
    - With what parameters
    - In what sequence
Controls & Guards

Around the LLM, you wrap:
- Schema validation (JSON schema, pydantic, etc.)
- Policy checks (“never call issue_refund for > $500 without human approval”)
- Rate limits & quotas (per user, per workflow)
- Human approval gates for high-risk actions
- Logging & tracing of every decision

Mentally, a typical automation run looks like:

Receive input -> orchestrator starts workflow instance
LLM classifies and extracts key fields
Deterministic checks (auth, entitlements, basic validation)
LLM proposes an action or plan (within a predefined set of tools)
Orchestrator:
- Enforces policies
- Executes allowed actions
- Routes to human if thresholds exceeded
LLM drafts notifications / notes
Everything is logged for audit and retraining

“Agents” vs. “copilots” vs. “workflows” is mostly packaging. The core pattern is this hybrid: deterministic skeleton, probabilistic muscles.

Where teams get burned (failure modes + anti-patterns)

The most common problems aren’t exotic. They’re basic systems engineering ignored because “AI.”

1. Unbounded autonomy

Anti-pattern:

“We connected the LLM to Jira, Slack, and our production APIs. It can do anything a human can.”

Failure modes:

Infinite loops of tool calls
Weird edge-case decisions (e.g., canceling active subscriptions to “fix” an issue)
No clear place to insert approvals

Fixes:

Cap the action space: only expose specific tools for specific workflows.
Explicit safety policies: hard-coded rules that override model output.
Guardrail states: certain states in the workflow always require human approval.

2. Treating the LLM as a single point of truth

Anti-pattern:

“The model decided the user is VIP; we trusted it.”

Failure modes:

Policy violations
Incorrect data updates
Security regressions (e.g., granting access based on misclassified tier)

Fixes:

The model can propose, not assert, key invariants.
Validate VIP status against your database, not the model’s text.
Use LLMs for interpretation, not source of truth.

3. Ignoring observability

Anti-pattern:

“We log the prompts somewhere. It’s fine.”

Failure modes:

You can’t answer “what did the bot do to this account?”
You don’t know error/deflection rates by workflow version
Investigations take days

Fixes:

Treat agent runs like distributed traces:
- Unique run ID
- Each LLM call logged with input, output, tool calls
- Final status, latency, and cost
Build or adopt:
- A simple “replay this run” button (with a different model/version)
- Dashboard per workflow: success rate, fallback rate, human‑takeover rate

4. Overfitting to happy path demos

Anti-pattern:

“The demo worked on 10 sample emails, let’s deploy.”

Failure modes:

Rare but critical cases fail silently
Leakage of sensitive data to wrong recipients
Timeouts and retries not tested

Fixes:

Build a scenario matrix:
- Normal cases
- Noisy input (typos, different languages)
- Adversarial-ish input (weird edge cases)
- Policy edge cases (limits, partial eligibility)
For each: expected behavior, human fallbacks, logging requirements.

5. No clear ownership

Anti-pattern:

“The AI team owns the agent. Also, ops. Also, product.”

Failure modes:

Incidents ping‑pong between teams
No one decides on rollout criteria or rollback policy
Inconsistent guardrails across workflows

Fixes:

For each AI automation:
- Product owner (what it’s trying to do)
- System owner (SLOs, incidents, rollouts)
- Policy owner (what it’s allowed to do / not do)

Practical playbook (what to do in the next 7 days)

Assume you’re a tech lead or CTO with some LLM experience, but no robust “agents in production” yet.

Day 1–2: Select a narrow, valuable, low-blast-radius use case

Good candidates share:

High volume, repetitive, boring
Medium complexity (existing RPA or SOPs >10 steps)
Clear success metric (handling time, deflection rate, accuracy)
Low irreversible impact (no money movement without approval)

Examples:

B2B SaaS support triage
- Input: inbound support emails / tickets
- Output:
  - Correct queue assignment
  - Priority tagging
  - Suggested reply for L1 support
- Success metrics:
  - % of tickets auto‑routed correctly
  - Reduction in time to first response
Invoice data extraction + coding
- Input: PDFs from vendors
- Output:
  - Structured JSON (vendor, PO, total, line items, GL codes)
  - Confidence score; low‑confidence goes to human
- Success metrics:
  - Human correction rate
  - Time saved per invoice
Customer account updates with approval
- Input: Customer emails like “Change my billing address”
- Output:
  - Parsed update
  - Validation via internal APIs
  - Draft change request, queued for human one‑click approval
- Success metrics:
  - Auto‑handled simple cases
  - Error rate on parsed fields

Day 3–4: Define the workflow as a state machine

Don’t start with prompts. Start with states and transitions:

States:
- received_input
- parsed_and_classified
- validated
- action_proposed
- awaiting_approval
- executed
- failed / escalated
For each state, define:
- Inputs required
- Which transitions are deterministic
- Where you call the LLM and what it’s allowed to do
- Where you require human approval

Sketch this as a diagram and get agreement from the ops team.

Day 4–5: Implement a thin vertical slice

Technology stack can be boring:

Orchestrator: your existing workflow engine, or a small service with a queue + DB.
LLM: hosted API or internal model; start with something reliable and observable.
Tools:
- Read‑only first (e.g., fetch customer, fetch invoice)
- Write actions behind a feature flag and approval workflow

Key practices:

Use structured outputs from the LLM (JSON with schema validation).
Log everything with a run_id:
- Raw input
- Prompt template version
- Model name/version
- Output + validation errors
- Tool calls + responses
Implement manual override:
- “Take this ticket out of the bot flow”
- “Re-run this step with updated context”

Day 6: Create evaluation harness

Before opening the gate:

Build a test set from real historical data:
- 50–200 examples is enough for first signal.
- Label them with expected:
  - Classification
  - Extracted fields
  - “Is this safe to auto‑execute?”
Run the workflow in shadow mode:
- No writes, only proposed actions.
- Compare to what humans did historically.

Metrics to track:

Accuracy on classification / extraction
% of proposed actions that match past decisions
% of cases correctly escalated (not auto‑approved)

Decide in advance:

What thresholds are acceptable for:
- Fully automated cases
- Human-in-the-loop cases

Day 7: Controlled rollout

Roll out progressively:

Internal teams only (dogfooding)
Small subset of users / tickets / vendors
Expand as you hit:
- Target success metrics
- Stable error rates
- No scary incidents during on‑call

Communicate clearly to ops:

When to turn it off (kill switch)
How to inspect past runs
How to flag bad outcomes for analysis/retraining

Bottom line

AI automation in real businesses is not “magic agents that think.” It’s:

Old‑fashioned workflow design
With LLMs replacing brittle parsing and decision logic in well‑bounded places
Wrapped in controls, observability, and human approvals

Compared to classic RPA and hand‑rolled scripts, the trade‑offs are:

Pros

Much faster to adapt to new document formats and edge-case phrasing
Can handle semi‑structured chaos without a combinatorial explosion of rules
Better operator experience: language‑based explanations instead of opaque errors

Cons

Probabilistic behavior; you must engineer for mistakes
New failure modes (prompt regressions, model updates impacting behavior)
Ongoing inference cost, which you need to monitor like any other infra bill

If you treat “agents” as a feature of your workflow engine—not a replacement for it—you can get concrete wins:

20–50% reduction in handling time for specific back‑office flows
Measurable ticket deflection
Reduced RPA maintenance as models replace brittle selectors

The organizations seeing real value are doing mundane things well:

Narrow scopes, clear SLOs
Hybrid deterministic + LLM design
Aggressive logging and guardrails
Incremental rollout with humans in the loop

The question isn’t “Will agents replace humans?” It’s:

“Where can we safely let a probabilistic engine handle the glue work humans hate, while keeping hard constraints and judgment where they belong?”

Answer that per workflow, and you’ll get practical automation, not another shelfware AI project.

Your RPA Bots Are Shell Scripts With delusions of Grandeur

Why this matters right now

What’s actually changed (not the press release)

How it works (simple mental model)