Your RPA Scripts Are a Liability: A Pragmatic Guide to AI Automation That Actually Works
Why this matters right now
Most companies already automated the obvious stuff:
- RPA clicking through legacy UIs
- Cron-driven ETL jobs
- CRUD-heavy internal tools
Those helped, but they’re fragile, hard to change, and blind to context. They also don’t touch a large class of “semi-structured” work that humans still grind through:
- Reading and routing inbound emails/tickets
- Reconciling documents and systems with slightly different schemas
- Drafting responses, updates, and reports based on scattered data
AI automation—agents, workflows, copilots, orchestration—is being sold as the magic layer on top. The reality is less magical and more interesting:
- You can now automate work that used to require a human reading/thinking step.
- The cost of orchestrating “micro-decisions” across systems is dropping.
- You can replace some brittle RPA with higher-level, schema-aware logic.
But:
- LLMs are nondeterministic.
- Tool APIs fail.
- Guardrails are immature.
- Ops teams are not used to debugging probabilistic systems.
If you’re running production systems with real SLAs and compliance constraints, the question is not “Can we have AI agents?” but:
- Where do they actually add leverage?
- How do we keep them from turning into an un-auditable blob in the critical path?
This post is about that: mechanisms, not magic.
What’s actually changed (not the press release)
A few real shifts make AI automation useful beyond demos. Not “AGI any day now”—concrete changes you can exploit now.
1. LLMs are good enough at structured reasoning with constraints
They’re not “smart” in the human sense, but:
- They can follow structured instructions reliably when:
- You bound the task narrowly.
- You give them the schema and examples.
- You validate outputs mechanically.
This enables use-cases like:
- Classifying inbound tickets into 15+ categories and extracting 5-10 fields.
- Mapping semi-structured docs into internal schemas (invoices, contracts, reports).
- Generating SQL queries from natural language and validating against a schema.
2. Tool use + function calling is real, not a toy
Models can:
- Decide which tool (API) to call.
- Construct arguments.
- Interpret responses.
- Iterate.
This is the core of “agents” and AI workflows:
- The model orchestrates a sequence of API calls, not just spits out text.
- You can keep your existing microservices / backends and let the model glue them.
3. Infra for orchestration exists (or is easy to build)
Workflow engines and orchestrators (Temporal-style, Airflow, in-house DAG runners):
- Can coordinate multi-step, long-running processes.
- Can track state, retries, and compensating actions.
- Can host “AI steps” as just another activity.
You no longer need a special-purpose RPA platform to automate across systems.
4. Cost/performance trade-offs are now tunable
You can:
- Mix small, cheap models for classification/triage with large models for complex reasoning.
- Cache prompt/response pairs (and even tool call plans).
- Fine-tune small models on your domain to replace generic LLM calls for narrow tasks.
This matters for real workloads where 10M+ calls/month is not hypothetical.
How it works (simple mental model)
Think less “autonomous agent” and more “workflow with decision slots.” A practical mental model:
1. The three-layer stack
-
Deterministic layer (bottom)
- Your existing services, APIs, and data stores.
- Traditional automation: workflows, batch jobs, microservices.
- Properties: deterministic, testable, observable.
-
Decision layer (middle – AI-powered)
- LLMs and smaller models making bounded decisions:
- Classification, extraction
- Routing, prioritization
- Plan generation (what steps to run)
- Outputs are validated by code or schemas.
- LLMs and smaller models making bounded decisions:
-
Interaction layer (top – humans & UIs)
- Copilots embedded in tools (CRM, ticketing, IDEs).
- Review/approval flows where humans supervise AI actions.
- Feedback loops (corrections, overrides).
Key principle: AI lives in the decision layer, not the execution layer.
It decides what to do; deterministic code does the doing.
2. A generic AI automation pipeline
Any serious AI “agent” system is some flavor of:
-
Intake
- Raw input: email, PDF, chat, event stream.
- Preprocessing: OCR, parsing, splitting, enrichment with metadata.
-
Interpretation
- Model call(s) to:
- Classify the request.
- Extract structured fields.
- Generate an execution plan (sequence of steps + tool calls).
- Model call(s) to:
-
Orchestration
- Workflow engine executes the plan:
- Calls tools/APIs.
- Stores intermediate state.
- Handles retries/backoff/compensation.
- Workflow engine executes the plan:
-
Verification
- Schema validation, business rules.
- Heuristics/secondary models for anomaly detection.
- Optional human-in-the-loop for risky paths.
-
Act / respond
- Perform the side-effect (update system, send email, post to Slack).
- Log everything: inputs, outputs, decisions, tool calls.
Where “agents” differ is mostly:
- How dynamic the plan generation step is.
- How much autonomy you give the model to iterate/tool-call without human approval.
Where teams get burned (failure modes + anti-patterns)
1. “Let the agent figure it out” in the critical path
Anti-pattern:
- Give an LLM broad goals (“Handle billing tickets”), all your tools, and no tight constraints.
- Put it directly in the production flow.
Failure modes:
- Cost blowups: the model loops through tools.
- Latency spikes: seconds → minutes.
- Weird corner cases: tool misuse, partial updates, inconsistent states.
Mitigation:
- Predefine allowed plans per use case; let the model choose among them or fill in parameters.
- Cap tool invocations per task.
- Use timeouts and fallback paths.
2. Over-trusting “structured outputs”
Anti-pattern:
- “The model returns JSON, so we’re safe.”
Reality:
- The JSON can be syntactically valid but semantically wrong.
- Models hallucinate IDs, dates, or codes that look plausible.
Mitigation:
- Strict schema validation (types, enums, ranges).
- Cross-checks against your DB or source of truth.
- For high-risk flows (payments, compliance): require corroboration from at least one deterministic rule or separate model.
3. Treating AI like RPA (pixel- and DOM-driven)
Anti-pattern:
- Use AI to drive headless browsers clicking through legacy UIs as your primary integration.
Problems:
- Brittle, slow, hard to observe.
- Debugging is painful.
- Changes in UI break everything.
Mitigation:
- Use RPA/DOM automation only as a last resort.
- Build thin adapters over legacy apps (even if that means scraping beneath the UI once and exposing a stable internal API).
- Let AI work with APIs and schemas, not pixels.
4. No ownership for “AI behavior”
Anti-pattern:
- Data science teams own models.
- Platform team owns workflows.
- Business teams own processes.
- Nobody owns the emergent behavior of “the AI system.”
Result:
- Incidents where “the bot did X” and nobody can explain why.
- No clear on-call or runbook for AI-specific failures.
Mitigation:
- Treat AI flows as first-class services:
- Clear SLOs (latency, accuracy, cost).
- Owners and on-call rotation.
- Runbooks that include model/path-specific diagnostics.
- Log the decision trace in a way humans can inspect.
5. Ignoring dataset and prompt drift
Anti-pattern:
- “It worked in the pilot, ship it.”
- Never retrain/fine-tune, never re-evaluate.
Drift sources:
- Business rules change.
- Input distribution changes (new product lines, markets).
- Prompt changes or model upgrades.
Mitigation:
- Keep labeled examples for key flows.
- Regular evaluation (weekly/monthly) on a stable test set.
- Change management for prompts and models (versioning, rollout, rollback).
Practical playbook (what to do in the next 7 days)
Assuming you’re a tech lead / architect trying to move beyond demos.
Day 1–2: Inventory and triage
-
List candidate workflows where humans are doing repetitive, semi-structured tasks:
- Reading → Deciding → Updating a system.
- Reading → Drafting → Sending a communication.
- Reading → Matching / reconciling between systems.
-
Score them on:
- Business impact (time/$ saved, latency reduced, error reduction).
- Tolerance for mistakes (low/medium/high).
- Integration complexity (do stable APIs exist?).
- Data sensitivity (PII, financial, regulated).
-
Pick 1–2 workflows:
- Medium to high impact.
- Medium risk (not life-or-death, not trivial).
- With at least one system having a usable API.
Example candidates:
- Triage inbound customer emails into:
- Category, priority, owner team, and suggested response template.
- Classify invoices and extract line items into a finance system.
- Summarize and route internal incident reports.
Day 3–4: Build a constrained prototype
Target: an end-to-end vertical slice that runs in a sandbox or shadow mode.
Architecture:
- Intake: Capture sample inputs (emails, docs, tickets).
- Interpretation step (LLM):
- Prompt the model to output structured JSON.
- Explicitly define:
- Allowed categories/statuses.
- Required fields with examples.
- Validation:
- JSON schema validation.
- Deterministic business rules (e.g., “priority must be one of P0–P3”).
- Orchestration:
- Use your existing workflow engine or a simple job runner.
- Wire up API calls to your internal systems (in sandbox).
- Output:
- For now, log the proposed actions instead of performing them.
- Or run in “assistant” mode where a human confirms.
Keep the “agent” shallow:
- No open-ended tool selection; hard-code which APIs to call.
- No iterative loops; one-pass decisions.
Day 5: Evaluate with real data
- Run the pipeline on real historical data (or live data in shadow mode).
- Collect metrics:
- Accuracy of classification/extraction vs. ground truth.
- Time and cost per task.
- Failure rates at each stage (model errors, validation rejects, API failures).
- Identify:
- Obvious prompt bugs.
- Systemic misclassifications.
- Integration bottlenecks.
Day 6: Add guardrails and human-in-the-loop
- Define policy:
- For high-risk outcomes (e.g., refunds > $X, certain categories): require human approval.
- For low-risk outcomes: auto-apply if confidence is high.
- Implement:
- Confidence scoring:
- Simple initial proxy: secondary “validator” model or heuristic rules.
- UI for human review:
- Show input, AI proposal, reasons (if available), and easy approve/edit/reject.
- Confidence scoring:
Day 7: Decide path to production
Based on your findings, decide:
- Green light with scope:
- Which subsets of tasks can be fully automated.
- Which remain human-reviewed.
- Owners and SLOs:
- Who owns this flow operationally.
- Target accuracy, latency, and budget.
- Next 30 days:
- Observability: logging, dashboards (volume, error types, human overrides).
- Hardening: more tests, more validation, limited rollout to certain teams or customers.
Real-world patterns (anonymised)
-
B2B SaaS support triage
- Problem: Manual triage of 5–7k inbound tickets/day into 20+ queues.
- Solution: LLM-based classifier + extraction of 6 fields (product area, urgency, entitlement, sentiment, etc.), wired into existing ticketing system.
- Results:
- ~60% tickets fully auto-triaged with <2% override rate.
- Median time-to-first-action dropped from 2h → <5 min.
- All “act” steps are deterministic; the model only decides routing and metadata.
-
Mid-market finance team invoice processing
- Problem: OCR + rules couldn’t handle vendor variation; humans spent ~15 min/invoice.
- Solution: AI workflow:
- Extract structured data from PDF.
- Validate fields and cross-check against vendor master and PO system.
- Flag mismatches for human review.
- Results:
- ~70% straight-through processing.
- Exceptions now higher-quality (real edge cases), improving trust.
-
Internal cybersecurity alert triage
- Problem: Overwhelmed SOC analysts; many low-signal alerts.
- Solution:
- LLM summarizes logs and context.
- Classifies alerts into risk bands.
- For low-risk, suggests auto-close with explanation; analyst approves.
- Results:
- Analyst time per alert down ~40%.
- Alerts with meaningful human attention increased.
Bottom line
- AI automation is not about “autonomous agents” replacing teams. It’s about inserting a probabilistic decision layer into existing deterministic workflows.
- The biggest wins come from:
- Making semi-structured work machine-actionable.
- Reducing human time spent on reading, routing, and drafting.
- Replacing brittle RPA flows with schema-aware orchestration where possible.
- The main risks are:
- Letting unbounded agents into the critical path.
- Under-investing in validation, observability, and ownership.
- Treating AI automation as a special “lab project” instead of production software.
If you already ship reliable systems, you’re ahead: apply the same rigor—SLOs, testing, observability, change management—to AI workflows, and the “agent” story becomes less mystical and more like what it should be: another tool in your automation stack that earns its place by measurable impact.
