Stop Gluing LLMs to Forms: A Pragmatic Path from RPA to Real AI Automation

Why this matters this week
The last 12–18 months were about “getting an LLM into production.”
The next 12–18 will be about “removing humans from the middle of boring workflows without blowing up risk, compliance, or uptime.”
The pattern that’s now repeating across real businesses:
- RPA bots, integration scripts, and shared inboxes are the current “glue”
- LLMs + tools are good enough to decide and act in the middle of those flows
- Teams that treat this as systems work (not “chatbot projects”) are seeing:
- 20–60% reduction in manual ticket handling
- Measurable cycle-time cuts on operational workflows
- Lower maintenance vs brittle RPA scripts
But the failure cases are real:
- Silent errors when a “smart agent” does the wrong thing
- Unbounded context windows leading to cost blowups
- Shadow automations built by ops teams without engineering oversight
This week’s reality: the underlying tech for agents, AI workflows, and copilots has crossed a reliability threshold for narrow, well-instrumented tasks. If you’re still thinking in terms of “replace the UI with a bot,” you’re leaving most of the value (and control) on the table.
What’s actually changed (not the press release)
Three concrete shifts that matter to anyone running production systems:
-
Reasonably reliable tool use and orchestration
- Modern LLMs can:
- Call tools with typed arguments
- Interpret structured results
- Decide when to stop
- This means you can now ask a model to:
- Look up data from multiple systems (CRM, billing, logs)
- Infer a decision (e.g., “issue refund or escalate?”)
- Execute the decision via APIs
- Not “AGI agents”; more like “state machines whose branching logic is learned instead of hard-coded.”
- Modern LLMs can:
-
Cheaper, fast-enough models with acceptable quality
- You no longer need the largest frontier model for:
- Email triage
- Routing and classification
- Data extraction and transformation
- Latency and cost are now compatible with:
- Inline ticket processing
- Human-in-the-loop approvals
- Per-event LLM calls on backend workflows
- This materially changes the unit economics vs RPA:
- RPA: high upfront scripting, fragile maintenance, low run cost
- LLM automation: moderate upfront design, lower maintenance, slightly higher per-run cost, but more adaptable
- You no longer need the largest frontier model for:
-
Slightly better guardrails, observability, and policy control
- JSON-mode, constrained decoding, and tool-schema validation reduce hallucinated actions.
- Basic observability patterns (structured logs, traces, prompt/version tagging) are emerging.
- RBAC and policy layers around tools are becoming standard:
- E.g., “models can draft payments but never execute > $X without a human”
The net: you can now treat AI agents and workflows as first-class services in your stack, not “weird chatbots someone snuck into the help center.”
How it works (simple mental model)
Use this mental model for AI automation in real businesses:
1. Trigger → 2. Understanding → 3. Decision → 4. Action → 5. Feedback
-
Trigger (event, not conversation)
Examples:- New support ticket arrives
- Invoice processed
- Error logged in production
- Form submitted by a customer
Implementation: message bus, webhooks, cron jobs.
Avoid “user must start a chat” as your primary trigger. -
Understanding (LLM as parser/classifier)
- Parse unstructured input into a structured representation:
- Intent, entities, category, priority, risk flags
- Use small/cheap models for:
- Classification (e.g., “billing vs bug vs feature request”)
- Entity extraction (“customer ID, product, amount”)
- Parse unstructured input into a structured representation:
-
Decision (policy + model)
- Combine:
- Hard rules: compliance, limits, SLAs, “never do X”
- Model judgment: ambiguous edge cases, prioritization
- Think of it as:
if (rulescovercase) userules() else usemodel() - The LLM here is a policy adapter, not the sole authority.
- Combine:
-
Action (tools, not magic)
- Actions are explicit tools:
- “createticket”
- “applyrefund”
- “send_email”
- “updatecrmrecord”
- Each tool:
- Has a strict schema
- Enforces permissions
- Validates arguments
- The agent’s job: choose the right sequence of tools under constraints.
- Actions are explicit tools:
-
Feedback (supervision + metrics)
- Outcomes are logged:
- Was refund reversed?
- Did user reopen the ticket?
- Did on-call override the agent’s decision?
- These signals feed:
- KPI dashboards (error rates, manual intervention rates)
- Fine-tuning or prompt revisions
- Policy changes (“stop auto-closing this class of ticket”)
- Outcomes are logged:
If you remember nothing else: treat the LLM as a probabilistic decision module inside a deterministic, observable workflow.
Where teams get burned (failure modes + anti-patterns)
1. “Let the agent figure it out” (unbounded autonomy)
Anti-pattern:
– Single “doeverythingagent” with:
– Access to dozens of tools
– No clear task boundary
– No confidence thresholds
Symptoms:
– Tool thrashing (many calls per request)
– Weird edge-case behaviors that are hard to reproduce
– Skyrocketing cost
Fix:
– Narrow, named workflows:
– “refundeligibilitychecker”
– “churnriskscorer”
– “invoicedisputerouter”
– Limit tools per workflow to what’s actually needed.
– Enforce max steps and hard timeouts.
2. RPA mindset: “screen scraping + LLM = done”
Anti-pattern:
– Trying to replace brittle UI automation with:
– LLM that “reads the page”
– More HTML scraping
– No APIs, no contracts, everything breaks with minor UI changes.
Symptoms:
– High breakage when vendors update layouts
– “It worked yesterday” tickets every week
Fix:
– Use this prioritization:
1. First: APIs / webhooks (official or internal)
2. Second: headless browser + stable selectors with a minimal abstraction layer
3. Last resort: full RPA-style nav; wrap it behind a narrow, typed tool
The LLM should operate on stable abstractions, not raw DOM or PDFs where possible.
3. No safety rails on write actions
Anti-pattern:
– Model can both:
– Draft messages
– Execute irreversible actions (e.g., sending them, issuing payments)
Symptoms:
– Silent damage (e.g., customers get incorrect information)
– Incident reviews that say “we can’t reconstruct what the agent was thinking”
Fix:
– Separate proposal from commit:
– Proposal: model drafts suggested actions
– Commit: rules or humans approve
– Techniques:
– Dual-model check: small validator model checks high-stakes outputs
– Policy as code: e.g., “agentcannot(issuerefund > $500)”
– Soft launch: shadow mode where actions are logged only for weeks
4. Treating prompts as code, but not versioning them
Anti-pattern:
– Prompts edited in the UI or scattered in configs.
– No change history, no rollback.
Symptoms:
– Behavioral regression after “small prompt tweaks”
– Impossible to correlate changes with incidents
Fix:
– Put prompts in version control:
– One file per major workflow/policy
– Semantic version tags (v1.2.3) referenced by the app
– Log: prompt version + model version + tool versions on every run.
5. No owner, no SLOs
Anti-pattern:
– “The AI team” owns everything and nothing.
– No clear expectation for correctness or latency.
Symptoms:
– Surprise outages when models change or rate-limits hit
– “Is 90% correct good?” unresolved arguments
Fix:
– For each automation:
– Single service owner (like any microservice)
– SLOs:
– Accuracy proxy (e.g., “<1% of actions require reversal”)
– Latency (p95 end-to-end)
– Escalation path (who gets paged, when to auto-fallback to humans)
Practical playbook (what to do in the next 7 days)
Assuming you already have basic LLM capabilities in-house.
Day 1–2: Find one boring, structured workflow
Criteria:
– High volume (hundreds+ events/week)
– Text-heavy, semi-structured inputs
– Clear allowed actions
– Low to medium blast radius if wrong, or easy rollback
Examples:
– Support: classify tickets, propose responses, auto-close duplicates
– Finance ops: triage invoice discrepancies, route to correct owner
– DevOps: enrich incidents with likely service and on-call team
Deliverables:
– A single, well-defined workflow with:
– Inputs
– Current manual steps
– Allowed automated actions
– Success metric (e.g., “% tickets not touched by humans”)
Day 3–4: Wrap your systems in tools
For that one workflow:
-
Expose 3–7 tools:
- Read-only:
get_ticket_detailslookup_customerfetch_recent_errors
- Write:
add_ticket_noteapply_credit(limit=$X)set_ticket_status(with guardrails)
- Read-only:
-
For each tool:
- Define strict schema (types, allowed values)
- Implement authorization:
- Limit by tenant, environment, monetary caps
- Add logging:
- Who/what called it (agent id, model, prompt version)
- Inputs/outputs (with redaction for PII as needed)
Day 4–5: Implement the narrow agent workflow
Architecture:
- Trigger: event from your existing system (e.g., new ticket)
- Step 1: cheap model for classification & extraction
- Step 2: rules engine:
- If “risky” or “high value” → route to human with model-generated suggestions only
- Else → call the agent for auto-handle
Agent constraints:
- Max tools per run: e.g., 4
- Max steps: e.g., 3
- Hard latency budget: e.g., 5 seconds total
- No
