Stop Gluing LLMs to UIs: A Pragmatic Path from RPA to Real AI Automation

Why this matters this week

Most teams experimenting with “AI agents” are rediscovering an old lesson: if you bolt an LLM onto a brittle process, you just get a more expensive, more chaotic RPA bot.

What is changing in real production systems:

Companies are quietly replacing:
- Screen-scraping RPA with API-native, tool-using AI workflows.
- One-off copilots with orchestrated task flows (approval, verification, logging).
- Ad-hoc “chat with your data” tools with embedded, domain-specific assistants inside existing apps.

And they’re not doing it because the demos are cool. They’re doing it because:

The unit economics look better than hiring another 20 offshore operators.
The failure modes are easier to contain than with legacy RPA.
They can measure throughput, error rates, and latency with the same rigor as any other production service.

This post is about that layer: AI automation in real businesses—agents, workflows, copilots, orchestration—specifically as a replacement for brittle RPA, not as a shiny UX toy.

What’s actually changed (not the press release)

Three concrete shifts in the last 6–12 months have made AI automation less of a research project and more of an engineering problem.

1. Tool-using LLMs are finally usable

LLMs that can reliably:

Call tools via structured schemas.
Respect function signatures reasonably well.
Chain multiple tool invocations in a single “thought.”

This means you can:

Have an “agent” that:
- Pulls data from your CRM,
- Queries internal APIs,
- Writes to a ticketing system,
- Summarizes the result for a human approver.

…without hacky prompt regex and DOM scraping.

2. Orchestration frameworks got boring (in a good way)

Instead of bespoke “agent loop” code in each app, we’re seeing:

Reusable patterns around:
- Task decomposition.
- Multi-step workflows with checkpoints.
- Human-in-the-loop review gates.
- Idempotency and retries.

These are increasingly aligning with familiar concepts from workflow engines (DAGs, state machines), just with LLMs embedded as decision and transformation nodes.

3. Data access is no longer “dump everything into a vector DB”

Teams are:

Building thin semantic layers on top of existing systems:
- “Customer 360” queries answered via your warehouse, not via scraped PDFs.
- Policy and compliance rules maintained as code, not hidden in prompts.

And they’re using vector search as one tool in the stack, not the data architecture.

How it works (simple mental model)

Forget “autonomous agents roaming your systems.” Think of AI automation as:

A workflow engine where some steps are LLM calls instead of hard-coded logic.

You still design the workflow. The LLM fills in judgment calls and unstructured text handling.

Core building blocks

Tasks, not “agents”
- Task = A defined unit of work with:
  - Inputs (structured + unstructured).
  - Preconditions.
  - Success criteria.
- Examples:
  - “Validate this invoice against PO and contract.”
  - “Summarize this support ticket and suggest 3 response options.”
  - “Extract key fields from this email and create a CRM lead.”
Tools
- Narrow, deterministic capabilities the LLM can invoke:
  - get_customer(id), create_ticket(data), run_sql(query_id, params), generate_invoice_pdf(data).
- Tools are your safety rails:
  - Least privilege by default.
  - Strong typing.
  - Auditable logs.
Controllers
- Non-LLM logic that:
  - Orchestrates which tasks to run.
  - Sets budgets (tokens, time, retries).
  - Decides when to escalate to humans.
- Often implemented as:
  - Workflow DAGs (e.g., Step → Step → Parallel Steps → Join).
  - State machines (Pending → In Review → Completed/Failed).
Guards
- Hard constraints around the LLM:
  - Schema validators.
  - Policy checkers (e.g., PII, forbidden actions).
  - Cost/latency limits.
- Think of them as middleware around every model call.

Mental model in one diagram (textual)

External trigger (event, API, cron)
→ Controller decides which Task to run
→ Task uses LLM + Tools to produce structured result
→ Guards validate / sanitize
→ Controller persists state + may trigger next Task or human review.

This is fundamentally different from the old RPA bot that pretends to be a user:

You operate at the API and data level, not the UI DOM level.
The AI handles ambiguity and text synthesis, not navigation.

Where teams get burned (failure modes + anti-patterns)

1. Letting the LLM own control flow

Anti-pattern:

One giant prompt: “You are an agent. Decide what to do next, call tools as needed, keep going until done.”

Issues:

Unbounded loops, unpredictable latency.
Hard to test and reason about.
Cost blowups when prompts accrete context.

Fix:

Keep control flow in your code / workflow engine.
Use LLMs inside steps, not as the orchestrator of steps (at least initially).

2. Over-trusting free-form output

Anti-pattern:

Letting LLM output directly:
- SQL queries executed against prod.
- API calls composed from natural language.
- Emails sent to customers.

Fix:

Structured outputs with strict validation:
- JSON schemas with required/optional fields.
- Enums for allowed actions.
Post-LLM guards:
- Query analyzers.
- Dry-runs.
- Staging or sandbox environments for high-risk operations.

3. Ignoring observability

Anti-pattern:

No metrics beyond “we called the model X times.”

Symptoms:

You don’t know:
- Which workflows are saving time vs. just moving work around.
- Where hallucinations or data errors are happening.
- Token cost per transaction.

Fix:

Treat each AI workflow as a service:
- Capture: input type, model used, tools called, latency, token cost, error classification.
- Build dashboards by workflow and task, not by model.

4. “RPA with extra steps” deployments

Anti-pattern:

Point LLMs at the UI, scrape HTML, drive headless browsers:
- Same brittleness as RPA with extra latency and opacity.

Sometimes you have no API access. Even then:

Use LLMs for:
- Parsing complex documents.
- Normalizing messy inputs.
Keep the UI automation piece as thin and deterministic as possible.

5. No explicit “fail safe” modes

Anti-pattern:

When the AI can’t complete a task, it silently does something approximate.

Fix:

Design intentional fallbacks:
- Clear “I don’t know” paths.
- Human escalation queues.
- Partial completion with clear flags (e.g., confidence_score, requires_review: true).

Practical playbook (what to do in the next 7 days)

Assuming you’re a tech lead / architect with some mandate but not infinite time.

Day 1–2: Identify one candidate workflow

Criteria for a good first target:

High volume, medium complexity, low direct blast radius.
Current process is:
- Text-heavy or unstructured.
- Done by humans following relatively clear patterns.
Systems already expose APIs or at least stable data access.

Examples that work well:

Invoice triage in a mid-size B2B company
- Input: PDFs + emails.
- Output: Validated line items, matched to POs, flags for exceptions.
- Impact: Finance team hours, faster payment cycles.
Support ticket summarization + suggestion
- Input: Long threads from various channels.
- Output: Structured summary + 2–3 suggested replies.
- Impact: Handle time reduction without changing your helpdesk stack.
Lead enrichment + routing
- Input: Raw form submissions/emails.
- Output: Normalized company, segment, enrichment, route to correct owner.
- Impact: Sales ops efficiency, faster SLA.

Pick one. Write down:

Current average handle time.
Error rate / rework rate (even a rough estimate).
Volume per day.

These become your baseline metrics.

Day 3: Model the workflow explicitly

Draw a simple flow (on paper / whiteboard is fine):

Trigger → Step 1 → Decision → Step 2A/2B → Possible human review → Done.

Classify steps:

Deterministic (API call, lookup, simple rule).
Judgment / text manipulation (summarize, interpret, classify, extract).

Plan to use LLMs only for the second category.

Day 4–5: Build a narrow “AI inside workflow” prototype

Key constraints:

One model family.
One or two tools only.
Full logging from day one.

Implementation sketch:

Wrap the LLM behind:
- A small service with:
  - Fixed prompt templates per task.
  - JSON schema enforcement (either via built-in tool calling or post-parse validation).
Wire into your workflow engine / job queue:
- Accept a job.
- Call deterministic steps.
- For judgment steps:
  - Build a minimal context (only what’s needed).
  - Call the LLM.
  - Validate and return structured data.

Keep a hard ceiling:

Max tokens per call.
Max number of LLM calls per job.
Timeouts that fall back to human handling.

Day 6: Run in shadow mode

Run the workflow alongside existing processes for a small sample:

50–200 real items, depending on volume.
Log:
- AI output vs. human outcome.
- Cases where:
  - AI is wrong.
  - AI is right but unhelpful.
  - AI is right and saves time.

Have operators tag:

“Would you trust this with no review?”
“Needs partial review.”
“Completely wrong.”

You want a rough measure of precision, not perfection.

Day 7: Decide deployment mode

Based on shadow results, choose:

Copilot mode (lowest risk)
- AI prepares suggestions; humans remain primary actors.
- Good when:
  - Error cost is high.
  - You need trust/acceptance from operators.
Assisted automation
- AI does the full workflow, but 10–20% of cases