Stop Calling It an “Agent” If It’s Just a Script with LLM Glue

Why this matters this week
The “AI agent” pitch has shifted from slides to purchase orders. In the last 60 days, I’ve seen:
- A B2B SaaS company ship an LLM-based ticket triage copilot that quietly replaced 40% of their manual routing and tagging (with better SLA adherence than their old rules engine).
- A logistics firm rip out a brittle RPA chain (7 tools, weekly breakage) and swap in an LLM+workflow system that survived a full ERP UI redesign with zero code changes.
- A fintech back-office team move from “copy-paste between 3 systems” to a supervised automation that handles 60% of KYC remediation tasks end-to-end.
This is not about “AGI agents.” This is about:
- Using language models as tolerant glue between messy systems.
- Replacing XPath- and CSS-selector-fragile RPA with semantic navigation.
- Embedding decision support and task orchestration into existing workflows, not building a sentient butler.
Money and time are now being spent on automation that looks mundane but moves real metrics:
- Ticket handle time
- Time-to-close for ops workflows
- SLA breaches
- Headcount growth vs. volume growth
If you own a system with repetitive, semi-structured knowledge work (support, ops, finance, supply chain), you’re either evaluating this or your CFO will ask why you’re not.
What’s actually changed (not the press release)
Three practical shifts, not eleven-page whitepapers, explain why AI automation is now viable in production.
1. LLMs are “good enough” at ugly enterprise text
Not perfect, not magical—good enough:
- They can robustly extract structured fields from:
- Messy emails
- Poorly formatted PDFs
- Screens with inconsistent labels
- They can normalize to your schema with high recall and decent precision, especially when:
- You show 10–20 in-domain examples
- You validate on a small labeled set
This kills a lot of custom regex, hand-tuned NLP, and brittle parsing scripts.
2. Tool-using models are becoming standard, not exotic
Models can now reliably:
- Call out to tools (APIs, databases, internal services) with:
- Reasonable argument formatting
- Multi-step plans (simple ones) across tools
- Stay within a constrained action set:
- “You may only call these 5 tools”
- “You may only write to this subset of fields”
This is the real “agent” story: LLMs as decision engines inside well-defined toolboxes, not free-roaming browsers.
3. Orchestration platforms are no longer just RPA++
You don’t have to bolt everything onto screen-scraping robots. Modern stacks give you:
- Event-driven orchestration (queue messages, webhooks, cron) instead of UI triggers.
- Native support for:
- Retries with backoff
- Idempotency keys
- Workflow versioning and migration
- Human-in-the-loop steps
That makes AI automation feel closer to building backend services than “teaching a robot to click.”
How it works (simple mental model)
Drop the “agent” mystique; think in terms of four building blocks:
-
Triggers (When to start)
Events that kick off the automation:- New ticket created
- Invoice received
- Daily batch job
- Alert from monitoring
-
LLM Decisions (What to do)
The model’s job is classification + planning:- Classify input into:
- Category / intent
- Priority
- Risk flags
- Plan a short sequence of tool calls:
- “Check customer status”
- “Fetch related orders”
- “Propose resolution”
- “Update CRM”
- Classify input into:
-
Tools (How to do it)
Normal code/services with carefully defined contracts:get_customer(id)create_ticket(payload)update_invoice_status(id, status)post_comment(thread_id, text)
The LLM selects tools and fills arguments; your code executes them and returns results.
-
Control Plane (Who’s in charge)
This is not the LLM. It’s your workflow engine:- Tracks state
- Applies guardrails:
- Max steps
- Timeouts
- Permissions
- Routes to human review when:
- Confidence < threshold
- Consistency checks fail
- Risky actions are requested
Everything interesting happens at the interfaces:
- Prompt design + tool schemas = how reliably the LLM picks legal actions.
- Workflow orchestration = how safely those actions are sequenced and monitored.
- Human review UI = how fast you escalate and learn from edge cases.
If you think of it as: “LLM as a fuzzy adapter between humans and APIs”, you’re in the right mental frame.
Where teams get burned (failure modes + anti-patterns)
1. Confusing “works in demo” with “works in production”
Anti-pattern:
- Build a slick demo that:
- Works on cherry-picked tickets/invoices/emails
- Is measured by “wow” factor in the room
Then you deploy and discover:
- Edge cases swamp your success rate.
- Latency is 10–20 seconds per step.
- Observability is nearly nonexistent.
Mitigation:
- Start with constrained scopes: one queue, one doc type, one geography.
- Set a target like: “Automation handles 40% of cases at >98% accuracy on these 3 metrics.”
- Run in shadow mode for at least 1–2 weeks with real traffic.
2. Treating the LLM as autonomous instead of as a library
Anti-pattern:
- Letting the model:
- Freely write SQL
- Decide which APIs exist
- Construct arbitrary HTTP calls
This leads to:
- Security holes
- Dangerous writes
- Latent bugs that only appear under rare input combinations
Mitigation:
- Expose tools as a small, typed API surface.
- Validate all tool arguments in code:
- Schema validation
- Permission checks (“can this actor modify this resource?”)
- Use allowlists, not denylists.
3. Ignoring monitoring and drift
Anti-pattern:
- No metrics beyond “usage.”
- No feedback loop:
- When humans correct the AI, it’s not captured.
- Model updates are “pull latest version” decisions.
Mitigation:
- Track:
- Automation coverage (% of tasks touched)
- Resolution rate without human changes
- Rework rate where humans override the AI
- Latency per step and per workflow
- Build a simple “feedback capture” flow:
- Every correction is logged with:
- Input
- AI suggestion
- Human final answer
- Every correction is logged with:
- Version your prompts and workflows; don’t hot-swap in prod without A/B or canaries.
4. Over-relying on UI automation (RPA nostalgia)
Anti-pattern:
- Using AI agents to drive browser UIs for systems that already have APIs.
- Building full workflows on top of screen scraping because “we already have RPA.”
Result:
- Fragility due to UI changes
- Slow execution
- Hard-to-debug failure modes
Mitigation:
- Treat UI automation as last resort, not default.
- Policy: “If it has an API, we use the API.”
- If no API:
- Limit UI automation to the few steps that truly require it.
- Encapsulate those in a separate, well-monitored component.
5. No explicit risk boundaries
Anti-pattern:
- Same agent allowed to:
- Draft emails
- Issue refunds
- Change KYC status
Mitigation:
- Define risk tiers:
- Tier 0: Read-only, suggestions only
- Tier 1: Low-impact writes (tags, comments)
- Tier 2: Reversible operations (ticket routing, non-monetary status changes)
- Tier 3: Monetary or compliance-sensitive actions (refunds, approvals)
- Agents should start at Tier 0–1.
- Tier 2–3 actions must be gated by:
- Policy rules
- Human approvals
- Additional checksums/validations
Practical playbook (what to do in the next 7 days)
Assuming you have authority over at least one production workflow.
Day 1–2: Identify one “thin slice” candidate
Look for a process that is:
- High volume, low to medium risk.
- Textual and semi-structured.
- Already instrumented (you can measure throughput/error).
Common good targets:
- Support ticket triage and enrichment (not final responses).
- Invoice classification and field extraction.
- Internal request routing (IT, HR, finance).
Define success in numbers:
- Example: “Auto-categorize 60% of new tickets with <1% misroute rate.”
Day 3: Map the current workflow
With the team that actually does the work:
- Write down:
- Inputs (systems, formats)
- Outputs (what counts as “done”)
- Systems touched
- Decisions made by humans
- Separate steps into:
- Mechanical (copy/paste, lookups)
- Judgment (is this high priority? is this fraud?)
You’ll likely find 60–80% mechanical, 20–40% judgment.
Day 4: Design a minimal AI-assisted flow
Goal this week is assisted mode, not full autonomy.
Example architecture:
- Trigger: New ticket in queue.
- LLM call:
- Summarize the issue
- Propose category/priority
- Extract 3–5 key fields
- Tools:
- Fetch customer plan/tier
- Check for known incidents
- Output:
- Pre-filled ticket fields
- Suggested tags
- Short internal summary
Humans still:
- Approve or correct fields.
- Own final routing and response.
This gives you:
- Immediate productivity gain.
- High-signal training data from corrections.
Day 5–6: Implement with guardrails
Concretely:
- Choose a capable model (don’t over-optimize this yet).
- Define tools with explicit schemas.
- Implement:
- Timeouts per LLM call.
- Max steps if you’re using tool-calling.
- Logging of inputs/outputs (with redaction where needed).
- Add a simple UI for:
- Showing suggestions
- One-click “accept all / revert to manual”
- Flagging bad suggestions
Ensure:
- No write operations bypass human review at this stage.
- All writes are performed by your backend, not by arbitrary model decisions.
Day 7: Ship in shadow mode and instrument
Roll out to a small subset (10–20% of traffic
