Your RPA Bots Are Shell Scripts With delusions of Grandeur
Why this matters right now
If you run a real business, you already automate things:
- RPA bots clicking through legacy UIs
- Cron jobs, ETL pipelines, glue code
- Humans following 40‑step SOPs in Playbooks/Confluence
Those systems are:
- Brittle – small UI/API changes break everything
- Opaque – nobody knows what the bots are doing at 3am
- Narrow – every exception path becomes a new ticket
AI automation (agents, copilots, orchestrated workflows) promises to replace this with something:
- More flexible than RPA
- Cheaper than throwing more humans at the queue
- Faster to build than another bespoke microservice
But most “AI agent” pitches ignore:
- Determinism vs. stochastic behavior
- Security and data boundaries
- Observability, rollback, and incident handling
- Total cost of ownership: inference, integration, ops
This post is about how to reason concretely about AI automation in production: where it fits, where it fails, and how to experiment without lighting your hair on fire.
What’s actually changed (not the press release)
The technology shift is real, but narrower than the hype.
Three things that materially impact production systems:
-
LLMs can now act as “universal glue” across messy systems
- Parse semi‑structured documents, emails, logs, PDFs
- Call tools / APIs based on natural language instructions
- Generate structured outputs (JSON, SQL, DSLs) with reasonable reliability
This replaces a lot of regex‑driven pipelines and custom parsers with a model call plus a schema.
-
Tool-calling and function orchestration are practical
Old chatbots: “Here’s a string, good luck.”
New stack:
- “Given this user request, pick a tool, call it with typed arguments, use the result, possibly call another tool.”
- This allows composable workflows instead of single-shot prompts.
You can now have a model reliably do:
- Look up a customer
- Validate entitlements
- Draft an action
- Ask for human approval only when needed
-
Inference economics improved just enough
- You can run small or mid‑size models in your VPC or on cheaper GPUs/CPUs.
- Smaller models (or distilled versions) often suffice for narrow business processes.
- For many back‑office workflows, “good enough” accuracy at lower cost beats frontier-model perfection.
What has not fundamentally changed:
- LLMs are still probabilistic and can be wrong with high confidence.
- Out-of-the-box “autonomous agents” remain unsafe for anything high‑impact.
- You still need engineering around them: validation, guardrails, monitoring, versioning.
How it works (simple mental model)
Strip away the jargon. An “AI agent” system is basically:
A state machine where some transitions are decided by an LLM instead of a
switchstatement.
Four building blocks:
-
Inputs
- User messages, emails, support tickets
- Documents (contracts, invoices, PDFs, logs)
- Events from your systems (paymentfailed, shipmentdelayed)
-
Orchestrator
- A workflow engine / state machine: Temporal, Airflow, custom orchestrator, or a SaaS agent framework.
- Knows:
- Current state of the workflow
- Available actions (tools, API calls)
- When to ask the LLM for help vs. follow a deterministic rule
-
LLM + Tools
The LLM is used for a few specific capabilities:
- Classification and routing
- “Which playbook should I run for this email?”
- “Is this a refund, technical issue, or sales request?”
- Field extraction
- “Pull out invoice number, total, due date, vendor from this PDF.”
- Decision within a bounded context
- “Given policy X and customer data Y, is this request allowed?”
- Generation
- “Draft an email to the customer explaining the decision.”
- Tool invocation
- Model gets a menu of functions; decides:
- Which to call
- With what parameters
- In what sequence
- Model gets a menu of functions; decides:
- Classification and routing
-
Controls & Guards
Around the LLM, you wrap:
- Schema validation (JSON schema, pydantic, etc.)
- Policy checks (“never call
issue_refundfor > $500 without human approval”) - Rate limits & quotas (per user, per workflow)
- Human approval gates for high-risk actions
- Logging & tracing of every decision
Mentally, a typical automation run looks like:
- Receive input -> orchestrator starts workflow instance
- LLM classifies and extracts key fields
- Deterministic checks (auth, entitlements, basic validation)
- LLM proposes an action or plan (within a predefined set of tools)
- Orchestrator:
- Enforces policies
- Executes allowed actions
- Routes to human if thresholds exceeded
- LLM drafts notifications / notes
- Everything is logged for audit and retraining
“Agents” vs. “copilots” vs. “workflows” is mostly packaging. The core pattern is this hybrid: deterministic skeleton, probabilistic muscles.
Where teams get burned (failure modes + anti-patterns)
The most common problems aren’t exotic. They’re basic systems engineering ignored because “AI.”
1. Unbounded autonomy
Anti-pattern:
“We connected the LLM to Jira, Slack, and our production APIs. It can do anything a human can.”
Failure modes:
- Infinite loops of tool calls
- Weird edge-case decisions (e.g., canceling active subscriptions to “fix” an issue)
- No clear place to insert approvals
Fixes:
- Cap the action space: only expose specific tools for specific workflows.
- Explicit safety policies: hard-coded rules that override model output.
- Guardrail states: certain states in the workflow always require human approval.
2. Treating the LLM as a single point of truth
Anti-pattern:
“The model decided the user is VIP; we trusted it.”
Failure modes:
- Policy violations
- Incorrect data updates
- Security regressions (e.g., granting access based on misclassified tier)
Fixes:
- The model can propose, not assert, key invariants.
Validate VIP status against your database, not the model’s text. - Use LLMs for interpretation, not source of truth.
3. Ignoring observability
Anti-pattern:
“We log the prompts somewhere. It’s fine.”
Failure modes:
- You can’t answer “what did the bot do to this account?”
- You don’t know error/deflection rates by workflow version
- Investigations take days
Fixes:
- Treat agent runs like distributed traces:
- Unique run ID
- Each LLM call logged with input, output, tool calls
- Final status, latency, and cost
- Build or adopt:
- A simple “replay this run” button (with a different model/version)
- Dashboard per workflow: success rate, fallback rate, human‑takeover rate
4. Overfitting to happy path demos
Anti-pattern:
“The demo worked on 10 sample emails, let’s deploy.”
Failure modes:
- Rare but critical cases fail silently
- Leakage of sensitive data to wrong recipients
- Timeouts and retries not tested
Fixes:
- Build a scenario matrix:
- Normal cases
- Noisy input (typos, different languages)
- Adversarial-ish input (weird edge cases)
- Policy edge cases (limits, partial eligibility)
- For each: expected behavior, human fallbacks, logging requirements.
5. No clear ownership
Anti-pattern:
“The AI team owns the agent. Also, ops. Also, product.”
Failure modes:
- Incidents ping‑pong between teams
- No one decides on rollout criteria or rollback policy
- Inconsistent guardrails across workflows
Fixes:
- For each AI automation:
- Product owner (what it’s trying to do)
- System owner (SLOs, incidents, rollouts)
- Policy owner (what it’s allowed to do / not do)
Practical playbook (what to do in the next 7 days)
Assume you’re a tech lead or CTO with some LLM experience, but no robust “agents in production” yet.
Day 1–2: Select a narrow, valuable, low-blast-radius use case
Good candidates share:
- High volume, repetitive, boring
- Medium complexity (existing RPA or SOPs >10 steps)
- Clear success metric (handling time, deflection rate, accuracy)
- Low irreversible impact (no money movement without approval)
Examples:
-
B2B SaaS support triage
- Input: inbound support emails / tickets
- Output:
- Correct queue assignment
- Priority tagging
- Suggested reply for L1 support
- Success metrics:
- % of tickets auto‑routed correctly
- Reduction in time to first response
-
Invoice data extraction + coding
- Input: PDFs from vendors
- Output:
- Structured JSON (vendor, PO, total, line items, GL codes)
- Confidence score; low‑confidence goes to human
- Success metrics:
- Human correction rate
- Time saved per invoice
-
Customer account updates with approval
- Input: Customer emails like “Change my billing address”
- Output:
- Parsed update
- Validation via internal APIs
- Draft change request, queued for human one‑click approval
- Success metrics:
- Auto‑handled simple cases
- Error rate on parsed fields
Day 3–4: Define the workflow as a state machine
Don’t start with prompts. Start with states and transitions:
- States:
received_inputparsed_and_classifiedvalidatedaction_proposedawaiting_approvalexecutedfailed/escalated
- For each state, define:
- Inputs required
- Which transitions are deterministic
- Where you call the LLM and what it’s allowed to do
- Where you require human approval
Sketch this as a diagram and get agreement from the ops team.
Day 4–5: Implement a thin vertical slice
Technology stack can be boring:
- Orchestrator: your existing workflow engine, or a small service with a queue + DB.
- LLM: hosted API or internal model; start with something reliable and observable.
- Tools:
- Read‑only first (e.g., fetch customer, fetch invoice)
- Write actions behind a feature flag and approval workflow
Key practices:
- Use structured outputs from the LLM (JSON with schema validation).
- Log everything with a run_id:
- Raw input
- Prompt template version
- Model name/version
- Output + validation errors
- Tool calls + responses
- Implement manual override:
- “Take this ticket out of the bot flow”
- “Re-run this step with updated context”
Day 6: Create evaluation harness
Before opening the gate:
- Build a test set from real historical data:
- 50–200 examples is enough for first signal.
- Label them with expected:
- Classification
- Extracted fields
- “Is this safe to auto‑execute?”
- Run the workflow in shadow mode:
- No writes, only proposed actions.
- Compare to what humans did historically.
Metrics to track:
- Accuracy on classification / extraction
- % of proposed actions that match past decisions
- % of cases correctly escalated (not auto‑approved)
Decide in advance:
- What thresholds are acceptable for:
- Fully automated cases
- Human-in-the-loop cases
Day 7: Controlled rollout
Roll out progressively:
- Internal teams only (dogfooding)
- Small subset of users / tickets / vendors
- Expand as you hit:
- Target success metrics
- Stable error rates
- No scary incidents during on‑call
Communicate clearly to ops:
- When to turn it off (kill switch)
- How to inspect past runs
- How to flag bad outcomes for analysis/retraining
Bottom line
AI automation in real businesses is not “magic agents that think.” It’s:
- Old‑fashioned workflow design
- With LLMs replacing brittle parsing and decision logic in well‑bounded places
- Wrapped in controls, observability, and human approvals
Compared to classic RPA and hand‑rolled scripts, the trade‑offs are:
Pros
- Much faster to adapt to new document formats and edge-case phrasing
- Can handle semi‑structured chaos without a combinatorial explosion of rules
- Better operator experience: language‑based explanations instead of opaque errors
Cons
- Probabilistic behavior; you must engineer for mistakes
- New failure modes (prompt regressions, model updates impacting behavior)
- Ongoing inference cost, which you need to monitor like any other infra bill
If you treat “agents” as a feature of your workflow engine—not a replacement for it—you can get concrete wins:
- 20–50% reduction in handling time for specific back‑office flows
- Measurable ticket deflection
- Reduced RPA maintenance as models replace brittle selectors
The organizations seeing real value are doing mundane things well:
- Narrow scopes, clear SLOs
- Hybrid deterministic + LLM design
- Aggressive logging and guardrails
- Incremental rollout with humans in the loop
The question isn’t “Will agents replace humans?” It’s:
“Where can we safely let a probabilistic engine handle the glue work humans hate, while keeping hard constraints and judgment where they belong?”
Answer that per workflow, and you’ll get practical automation, not another shelfware AI project.
