Your RPA Scripts Are Glue Code with a Lobbyist
Why this matters right now
Most “AI automation” pitches sound like this:
“Agents will autonomously run your business; humans just supervise.”
Reality on the ground today:
- You still have 200 brittle RPA scripts clicking through legacy UIs.
- Your “copilots” help people type faster but don’t close tickets faster.
- The finance team still copies CSVs between SaaS tools at month-end.
- Your security team is nervous about an LLM having access to prod.
So what actually changed?
- We can now make systems that read, decide, and act across multiple applications without pixel-perfect UIs or handwritten parsing for every format.
- But we also introduced probabilistic behavior into workflows that used to be deterministic.
This is not just an architecture shift. It’s a governance shift: we’re moving from “scripts you can audit line-by-line” to “systems that learn patterns and need guardrails, observability, and risk acceptance.”
If you care about uptime, data protection, and unit economics, you can’t treat “AI agents” as a product category. Treat them as:
A new way to implement business processes where some steps are probabilistic and some are deterministic, and the interfaces are much looser.
The rest of this post is about how to exploit that without lighting your ops on fire.
What’s actually changed (not the press release)
Three non-marketing shifts matter:
1. We can reliably extract and normalize semi-structured data
Before:
- You wrote fragile regexes, DOM scraping, or ETL jobs for every source.
- “Invoice OCR project” was a 6–12 month slog.
Now:
- Modern models + a small schema give you:
- “Pull vendor name, invoice date, line items, total, and payment terms out of whatever this is” with high recall.
- The unit cost per field extracted is low enough to run at scale.
Implication:
A lot of RPA scripts whose main purpose was “get data out of X to put in Y” are now economically obsolete.
2. Reasonable multi-step decision making is cheap enough
Before:
- Any non-trivial branching logic across tools had to be:
- Hard-coded (BPMN, workflow engines), or
- Manual (human triage queues).
Now:
- Models are good at:
- Mapping messy real-world states to one of N next actions.
- Generating an API call / SQL query / email / ticket update for that action.
They’re not “autonomous,” but they’re cheap policy engines that can absorb domain knowledge faster than your workflow tooling can.
3. Tooling for orchestration is catching up (barely)
We now have:
- Workflow orchestrators that:
- Call models,
- Call tools (APIs, DBs, internal services),
- Track state and retries,
- Log decisions and intermediate artifacts.
Most are immature, but the pattern is stable:
- LLM is not the workflow engine.
The workflow engine calls the LLM as one of many tools.
How it works (simple mental model)
Forget “agents” for a minute. Use this mental model instead:
LLM-powered worker = Policy + Interface + Auditor
Where:
-
Policy:
- A function:
(context) -> (decision, rationale) - Implemented by:
- A prompt,
- Optional fine-tuning or RAG,
- Guardrails (allowed actions, constraints).
- A function:
-
Interface (the “hands”):
The worker never “clicks screens” in your mental model. Instead, it:
- Calls:
- HTTP APIs
- Internal services
- Workflow steps
- Through a narrow, typed interface you control.
This might still be implemented via RPA under the hood in places, but:
- The AI layer talks to stable, named operations like
create_invoice(...), not “click here, then tab twice, then paste.”
- Calls:
-
Auditor (the “adult in the room”):
- Observes:
- Inputs,
- Model outputs,
- Resulting side effects (DB writes, emails, etc.).
- Applies rules:
- Block obviously wrong / risky actions.
- Sample for human review.
- Escalate ambiguous or high-value cases.
The auditor can be:
- Partially automated (another LLM with stricter rules).
- Human-in-the-loop for sampled or high-risk flows.
- Observes:
Execution pattern
For a given business workflow (e.g., “process vendor invoice”):
-
Trigger:
- Email arrives / file lands in S3 / ticket created.
-
Data normalization:
- LLM/vision model turns unstructured input → structured JSON.
-
Policy step:
- “Given this structured input and current state, what is the next action?”
- Returns:
action_type,parameters,confidence,rationale.
-
Interface call:
- Orchestrator calls the relevant tool or service.
-
Audit:
- Validate response against hard constraints.
- Log machine-readable trace.
- Optionally route to human for approval.
-
Loop:
- Repeat 3–5 until the workflow reaches a terminal state.
Two concrete patterns
-
Support ticket deflection + resolution
- Policy: “Can I fully resolve this ticket with current knowledge and permissions?”
- Interface:
get_kb_articles,get_customer_contract,update_ticket,create_bug. - Auditor:
- If ticket mentions “outage,” “security,” or customer value > X:
- Require human review before closing.
- If ticket mentions “outage,” “security,” or customer value > X:
-
Customer onboarding workflow
- Policy: “Is this customer ready to move to the next KYC stage?”
- Interface:
fetch_documents,run_kyc_check,update_crm,send_email_template. - Auditor:
- Hard rule: Cannot mark KYC as “complete” without verified ID doc.
- Sample 10% of accounts for random human review.
Where teams get burned (failure modes + anti-patterns)
You’ll see the same patterns in postmortems.
1. Treating LLMs as deterministic CPUs
Failure pattern:
- “If the model answered it once, it will always answer it the same way.”
- No consideration for:
- Version changes,
- Temperature/config tweaks,
- Prompt drift.
Consequences:
- Flaky workflows,
- Non-reproducible bugs,
- Silent behavior changes after a model upgrade.
Mitigation:
- Log inputs, model version, and outputs.
- Fix temperature, top_p, etc. for automation flows.
- Treat model upgrades as schema migrations, with:
- Staging,
- Canaries,
- Rollback.
2. Over-trusting language understanding for hard constraints
Failure pattern:
- “The model knows not to transfer more than $10k without approval; we told it in the prompt.”
Consequences:
- Edge cases leak:
- Different currency formats,
- Model misreads,
- Jailbreak-style input.
Mitigation:
- Express hard constraints in code, not prompts:
if amount > 10000: require_approval()
- Use the LLM to propose, not enforce, for safety boundaries.
3. Over-indexing on “agents” instead of decomposition
Failure pattern:
- One massive “generalist” agent that:
- Reads emails,
- Opens JIRA tickets,
- Updates CRM,
- Sends quotes,
- Responds to customers.
Consequences:
- Hard to debug.
- Hard to test.
- Vague ownership.
- Surprises in production.
Mitigation:
- Decompose by business capability, not by “one agent per user”:
- “Invoice classifier”
- “Ticket resolver”
- “Escalation router”
- Use simple workflows + multiple specialized policies.
4. Ignoring cost dynamics
Failure pattern:
- Assume cost ~ $X per 1k tokens and call it a day.
Reality:
- Token usage and call volume explode with:
- Chained tools,
- Long prompts (audit trails, logs, RAG context),
- “Try again” retries.
Consequences:
- Margins killed by a helper process you thought was cheap.
- Shadow infra & usage patterns invisible to finance.
Mitigation:
- Treat token cost as a first-class metric:
- per-workflow,
- per-task,
- per-customer.
- Put quotas and rate limits in the orchestrator.
- Make “unit economics per automated task” explicit.
5. Data and access sprawl
Failure pattern:
- Give the automation layer “broad” prod access:
- All tickets,
- All docs,
- All customers.
Consequences:
- Hard-to-bound blast radius.
- Data exfil risk (especially if vendors are involved).
- Compliance headaches.
Mitigation:
- Apply the same least privilege you use for humans:
- Separate service accounts per workflow.
- Narrow scopes.
- Explicit audit logs.
- Segment automation into environments:
- Dev, staging, limited-prod, full-prod.
Practical playbook (what to do in the next 7 days)
Assume you’re a tech lead / CTO with real systems, not a greenfield lab.
Day 1–2: Identify one narrow, painful workflow
Criteria:
- Repeats daily.
- Has clear success criteria.
- Today is:
- Manual and boring, or
- RPA-heavy and brittle.
- Risk if it misbehaves is:
- Annoying, not catastrophic.
Examples:
- Classify incoming support emails and attach relevant KB articles.
- Extract invoice data into your accounting system.
- Triage bug reports into “frontend/backend/infra” and tag services.
Document:
- Current steps (5–12 steps, not 100).
- Systems touched.
- Who owns it today.
- Definition of “done” and what “harm” looks like.
Day 3: Draw the policy–interface–auditor diagram
For that one workflow:
-
Policy:
- What decision(s) are we asking the model to make?
- Enumerate possible actions, not English.
-
Interface:
- List the minimum tools needed:
read_email,create_ticket,add_tag,comment_ticket.
- Decide where you’ll call them from:
- Existing workflow engine,
- New simple orchestrator,
- A custom service.
- List the minimum tools needed:
-
Auditor:
- Define:
- Hard rules in code.
- Sampling rules (e.g., sample 20% of automated resolutions).
- Escalation triggers.
- Define:
Day 4–5: Build a thin vertical slice
Constraints:
- Don’t integrate with everything at once.
- Implement:
- 1–2 policy prompts,
- 2–4 tools,
- 2–3 audit checks.
Focus on:
- Logging:
- Persist raw inputs, decisions, tool calls, and outputs.
- Tag all with workflow instance ID.
- Control:
- Ability to:
- Disable automation with a flag,
- Route 100% to human review,
- Quickly tweak prompts and thresholds.
- Ability to:
Day 6: Run in shadow mode
Shadow mode definition:
- LLM makes decisions and proposes actions.
- System logs what it would have done.
- Humans continue normal operation.
You compare:
- For each instance:
- Did the model’s proposed action match the human?
- Where it differed, was the model acceptable, better, or dangerous?
Collect:
- Sample size: at least 50–100 real cases (more is better).
- Metrics:
- Accuracy (vs. human),
- False-positive / false-negative breakdown for risky actions,
- Average tokens per workflow instance.
Day 7: Decide your next move
If:
- Accuracy is high enough for low-risk actions,
- No catastrophic failure patterns observed,
- Cost per instance is reasonable,
Then:
- Move from shadow mode → assisted mode:
- Model makes the call for low-risk paths.
- High-risk paths still require explicit human approval.
Else:
- Don’t scale.
- Document failure modes.
- Adjust:
- Policy design,
- Inputs (maybe you need more structure / context),
- Guardrails.
The key is to get one narrow case to “boring reliability,” then expand.
Bottom line
AI automation in real businesses is not about “agents that run the company.”
It’s about:
- Replacing fragile UI-clicking and regex spaghetti with:
- Structured interfaces,
- Probabilistic policies,
- Strong auditors.
- Accepting that you are:
- Trading some determinism for coverage and speed,
- Trading some explicitness for learned behavior.
The teams that will quietly win here are not the ones with the flashiest “autonomous agent” demos. They will be the ones who:
- Treat LLMs as components inside workflows, not magic brains.
- Maintain observability, version control, and change management for model behavior.
- Make unit economics and risk boundaries explicit per use case.
Your RPA scripts aren’t going away tomorrow. But the economic and operational logic that justified them is. The task now is not to rip everything out, but to:
- Wrap what you have in sane interfaces.
- Slot in LLM-based policies where they materially reduce toil.
- Put auditable guardrails around the whole thing.
If you can’t explain how your AI-powered workflow makes decisions, what it’s allowed to touch, and what it costs per successful task, you’re not running automation—you’re running a demo.
