Stop Calling It an “Agent” If It’s Just a Script with LLM Glue

Wide cinematic shot of an automated factory control room at night, glowing dashboards and robotic arms visible behind glass walls, subtle cables and data flows represented as soft light trails connecting machines, cool blue and amber lighting, high contrast, slightly elevated angle, no people, abstract sense of orchestration and complexity

Why this matters this week

The “AI agent” pitch has shifted from slides to purchase orders. In the last 60 days, I’ve seen:

  • A B2B SaaS company ship an LLM-based ticket triage copilot that quietly replaced 40% of their manual routing and tagging (with better SLA adherence than their old rules engine).
  • A logistics firm rip out a brittle RPA chain (7 tools, weekly breakage) and swap in an LLM+workflow system that survived a full ERP UI redesign with zero code changes.
  • A fintech back-office team move from “copy-paste between 3 systems” to a supervised automation that handles 60% of KYC remediation tasks end-to-end.

This is not about “AGI agents.” This is about:

  • Using language models as tolerant glue between messy systems.
  • Replacing XPath- and CSS-selector-fragile RPA with semantic navigation.
  • Embedding decision support and task orchestration into existing workflows, not building a sentient butler.

Money and time are now being spent on automation that looks mundane but moves real metrics:

  • Ticket handle time
  • Time-to-close for ops workflows
  • SLA breaches
  • Headcount growth vs. volume growth

If you own a system with repetitive, semi-structured knowledge work (support, ops, finance, supply chain), you’re either evaluating this or your CFO will ask why you’re not.

What’s actually changed (not the press release)

Three practical shifts, not eleven-page whitepapers, explain why AI automation is now viable in production.

1. LLMs are “good enough” at ugly enterprise text

Not perfect, not magical—good enough:

  • They can robustly extract structured fields from:
    • Messy emails
    • Poorly formatted PDFs
    • Screens with inconsistent labels
  • They can normalize to your schema with high recall and decent precision, especially when:
    • You show 10–20 in-domain examples
    • You validate on a small labeled set

This kills a lot of custom regex, hand-tuned NLP, and brittle parsing scripts.

2. Tool-using models are becoming standard, not exotic

Models can now reliably:

  • Call out to tools (APIs, databases, internal services) with:
    • Reasonable argument formatting
    • Multi-step plans (simple ones) across tools
  • Stay within a constrained action set:
    • “You may only call these 5 tools”
    • “You may only write to this subset of fields”

This is the real “agent” story: LLMs as decision engines inside well-defined toolboxes, not free-roaming browsers.

3. Orchestration platforms are no longer just RPA++

You don’t have to bolt everything onto screen-scraping robots. Modern stacks give you:

  • Event-driven orchestration (queue messages, webhooks, cron) instead of UI triggers.
  • Native support for:
    • Retries with backoff
    • Idempotency keys
    • Workflow versioning and migration
    • Human-in-the-loop steps

That makes AI automation feel closer to building backend services than “teaching a robot to click.”

How it works (simple mental model)

Drop the “agent” mystique; think in terms of four building blocks:

  1. Triggers (When to start)
    Events that kick off the automation:

    • New ticket created
    • Invoice received
    • Daily batch job
    • Alert from monitoring
  2. LLM Decisions (What to do)
    The model’s job is classification + planning:

    • Classify input into:
      • Category / intent
      • Priority
      • Risk flags
    • Plan a short sequence of tool calls:
      • “Check customer status”
      • “Fetch related orders”
      • “Propose resolution”
      • “Update CRM”
  3. Tools (How to do it)
    Normal code/services with carefully defined contracts:

    • get_customer(id)
    • create_ticket(payload)
    • update_invoice_status(id, status)
    • post_comment(thread_id, text)

    The LLM selects tools and fills arguments; your code executes them and returns results.

  4. Control Plane (Who’s in charge)
    This is not the LLM. It’s your workflow engine:

    • Tracks state
    • Applies guardrails:
      • Max steps
      • Timeouts
      • Permissions
    • Routes to human review when:
      • Confidence < threshold
      • Consistency checks fail
      • Risky actions are requested

Everything interesting happens at the interfaces:

  • Prompt design + tool schemas = how reliably the LLM picks legal actions.
  • Workflow orchestration = how safely those actions are sequenced and monitored.
  • Human review UI = how fast you escalate and learn from edge cases.

If you think of it as: “LLM as a fuzzy adapter between humans and APIs”, you’re in the right mental frame.

Where teams get burned (failure modes + anti-patterns)

1. Confusing “works in demo” with “works in production”

Anti-pattern:

  • Build a slick demo that:
    • Works on cherry-picked tickets/invoices/emails
    • Is measured by “wow” factor in the room

Then you deploy and discover:

  • Edge cases swamp your success rate.
  • Latency is 10–20 seconds per step.
  • Observability is nearly nonexistent.

Mitigation:

  • Start with constrained scopes: one queue, one doc type, one geography.
  • Set a target like: “Automation handles 40% of cases at >98% accuracy on these 3 metrics.”
  • Run in shadow mode for at least 1–2 weeks with real traffic.

2. Treating the LLM as autonomous instead of as a library

Anti-pattern:

  • Letting the model:
    • Freely write SQL
    • Decide which APIs exist
    • Construct arbitrary HTTP calls

This leads to:

  • Security holes
  • Dangerous writes
  • Latent bugs that only appear under rare input combinations

Mitigation:

  • Expose tools as a small, typed API surface.
  • Validate all tool arguments in code:
    • Schema validation
    • Permission checks (“can this actor modify this resource?”)
  • Use allowlists, not denylists.

3. Ignoring monitoring and drift

Anti-pattern:

  • No metrics beyond “usage.”
  • No feedback loop:
    • When humans correct the AI, it’s not captured.
  • Model updates are “pull latest version” decisions.

Mitigation:

  • Track:
    • Automation coverage (% of tasks touched)
    • Resolution rate without human changes
    • Rework rate where humans override the AI
    • Latency per step and per workflow
  • Build a simple “feedback capture” flow:
    • Every correction is logged with:
      • Input
      • AI suggestion
      • Human final answer
  • Version your prompts and workflows; don’t hot-swap in prod without A/B or canaries.

4. Over-relying on UI automation (RPA nostalgia)

Anti-pattern:

  • Using AI agents to drive browser UIs for systems that already have APIs.
  • Building full workflows on top of screen scraping because “we already have RPA.”

Result:

  • Fragility due to UI changes
  • Slow execution
  • Hard-to-debug failure modes

Mitigation:

  • Treat UI automation as last resort, not default.
  • Policy: “If it has an API, we use the API.”
  • If no API:
    • Limit UI automation to the few steps that truly require it.
    • Encapsulate those in a separate, well-monitored component.

5. No explicit risk boundaries

Anti-pattern:

  • Same agent allowed to:
    • Draft emails
    • Issue refunds
    • Change KYC status

Mitigation:

  • Define risk tiers:
    • Tier 0: Read-only, suggestions only
    • Tier 1: Low-impact writes (tags, comments)
    • Tier 2: Reversible operations (ticket routing, non-monetary status changes)
    • Tier 3: Monetary or compliance-sensitive actions (refunds, approvals)
  • Agents should start at Tier 0–1.
  • Tier 2–3 actions must be gated by:
    • Policy rules
    • Human approvals
    • Additional checksums/validations

Practical playbook (what to do in the next 7 days)

Assuming you have authority over at least one production workflow.

Day 1–2: Identify one “thin slice” candidate

Look for a process that is:

  • High volume, low to medium risk.
  • Textual and semi-structured.
  • Already instrumented (you can measure throughput/error).

Common good targets:

  • Support ticket triage and enrichment (not final responses).
  • Invoice classification and field extraction.
  • Internal request routing (IT, HR, finance).

Define success in numbers:

  • Example: “Auto-categorize 60% of new tickets with <1% misroute rate.”

Day 3: Map the current workflow

With the team that actually does the work:

  • Write down:
    • Inputs (systems, formats)
    • Outputs (what counts as “done”)
    • Systems touched
    • Decisions made by humans
  • Separate steps into:
    • Mechanical (copy/paste, lookups)
    • Judgment (is this high priority? is this fraud?)

You’ll likely find 60–80% mechanical, 20–40% judgment.

Day 4: Design a minimal AI-assisted flow

Goal this week is assisted mode, not full autonomy.

Example architecture:

  • Trigger: New ticket in queue.
  • LLM call:
    • Summarize the issue
    • Propose category/priority
    • Extract 3–5 key fields
  • Tools:
    • Fetch customer plan/tier
    • Check for known incidents
  • Output:
    • Pre-filled ticket fields
    • Suggested tags
    • Short internal summary

Humans still:

  • Approve or correct fields.
  • Own final routing and response.

This gives you:

  • Immediate productivity gain.
  • High-signal training data from corrections.

Day 5–6: Implement with guardrails

Concretely:

  • Choose a capable model (don’t over-optimize this yet).
  • Define tools with explicit schemas.
  • Implement:
    • Timeouts per LLM call.
    • Max steps if you’re using tool-calling.
    • Logging of inputs/outputs (with redaction where needed).
  • Add a simple UI for:
    • Showing suggestions
    • One-click “accept all / revert to manual”
    • Flagging bad suggestions

Ensure:

  • No write operations bypass human review at this stage.
  • All writes are performed by your backend, not by arbitrary model decisions.

Day 7: Ship in shadow mode and instrument

Roll out to a small subset (10–20% of traffic

Similar Posts