Stop Calling It an “Agent” If It’s Just a Cron Job with a LLM

Wide cinematic shot of an industrial control room at dusk, glowing monitors showing interconnected node graphs and data flows, robotic arms and conveyor belts in the background, cool blue and orange lighting, high contrast, overhead perspective emphasizing complex systems and automation, no text

Why this matters this week

Two things are colliding in production environments right now:

  • LLMs are finally good enough (and cheap enough) to sit in the middle of real workflows.
  • The RPA/“screen scraping” band-aids you put in 2017 are starting to fail in ways that are expensive, opaque, and hard to fix.

If you run a real business system (contact center, back-office ops, logistics, finance, internal tools), you’re probably hearing:

  • “Let’s build an AI agent to handle this workflow end-to-end.”
  • “We can replace this brittle RPA bot with a copilot that just uses the APIs.”
  • “We’ll just orchestrate multiple tools; the AI will figure it out.”

The risk isn’t that this is impossible. The risk is that you build:

  • A demo that looks magical but fails in production variance.
  • An automation that quietly corrupts data.
  • A system that’s impossible to debug or get past security review.

This post is about the actual mechanics and trade-offs of AI automation in production: agents, copilots, orchestration, and replacing RPA in real businesses, not slideware.


What’s actually changed (not the press release)

Three non-hype shifts matter for people who ship systems:

  1. LLMs are now “good enough glue” between tools

    • You no longer need 100% deterministic rule trees for every branch.
    • LLMs can handle:
      • Parsing semi-structured inputs (emails, PDFs, chats).
      • Mapping them onto internal schemas.
      • Selecting which internal APIs to call in what order.
    • This doesn’t remove the need for structure; it moves it:
      • From “if-else trees in code” to “tool schemas + guardrails + state machines.”
  2. Cost and latency are acceptable for inner-loop automation

    • With modern models + streaming:
      • Sub-500ms “decision steps” are achievable.
      • Per-step cost can be in low cents or fractions of a cent, depending on model.
    • That makes AI automation viable inside:
      • Customer support flows.
      • Finance ops (invoice processing, approvals).
      • Supply chain / logistics exception handling.
  3. Tooling for orchestration is emerging (but immature)

    • There are now libraries/platforms for:
      • Tool calling (functions, actions, tools).
      • Long-running workflows (stateful agents).
      • Retrieval-augmented generation (RAG) for context.
    • The real change: you can compose LLM calls, tools, and human approvals without building a workflow engine from scratch.
    • The catch: you can also create distributed, probabilistic spaghetti very quickly.

What did not magically change:

  • Reliability is still not “set and forget.” You need monitoring, drift detection, and rollback.
  • Compliance is still your problem: data egress, PII, audit trails.
  • Integration cost is still high. AI doesn’t eliminate your systems’ weirdness; it just papers over some of it.

How it works (simple mental model)

A workable mental model for AI automation in production:

1. Think “state machine + stochastic steps”, not “autonomous agent”

Forget the marketing diagrams of agents “wandering around” solving problems.

Instead:

  • Define a finite set of states for your workflow.
  • At each state, define:
    • Inputs (structured + unstructured).
    • Permitted actions/tools.
    • Exit conditions to move to the next state (or fail/ escalate).
  • Use the LLM to:
    • Interpret unstructured inputs into state-relevant fields.
    • Choose among allowed tools/actions.
    • Generate outputs visible to humans (emails, tickets, explanations).

The LLM is a bounded policy inside a state, not your entire orchestration engine.

2. Tools are the real “API surface” of your business logic

Instead of letting the LLM “do anything,” you expose:

  • Deterministic tools:
    • create_ticket, update_order_status, fetch_invoice, calculate_refund.
  • Idempotent reads whenever possible:
    • Separate “read” and “write” tools.
    • Avoid tools that both query and mutate in one call.
  • Validation tools:
    • validate_address, check_policy_compliance, check_budget_limit.

The agent doesn’t “understand your business”; it just sequences tools you define.

3. Guardrails = contracts, not vibes

Guardrails that actually work look like:

  • JSON schemas for tool inputs/outputs.
  • Hard business constraints in the tool implementations:
    • “Refund cannot exceed last 90 days of charges.”
    • “Cannot change account owner without verified identity token.”
  • Policy checks outside the LLM:
    • Approvals for high-risk steps.
    • Rate limits and anomaly detection on certain tools.

Prompting is not your primary safety mechanism. Code and contracts are.

4. Humans remain part of the workflow

Turn “autonomy” into a tunable spectrum:

  • Low-risk, low-value actions → fully automated.
  • Medium-risk → “AI drafts, human confirms”.
  • High-risk → “AI preps context, human decides”.

You change the automation level per state based on observed performance and business risk, not on a single “on/off” switch.


Where teams get burned (failure modes + anti-patterns)

1. Treating LLM steps as if they were deterministic

Symptoms:

  • No explicit retry logic because “it worked in staging.”
  • Downstream code assumes fields will always be present and correctly formatted.
  • Silent failures: bad outputs that look plausible.

Mitigations:

  • Require schemas + validation on all LLM outputs.
  • Log both prompts and structured outputs for sampling.
  • Add statistical checks: volume, distribution shifts, anomaly alerts.

2. “One mega-agent” trying to do everything

Pattern:

  • Single agent with 20+ tools.
  • Ambiguous responsibilities (“it handles support, billing, and account changes”).
  • Debugging is impossible because behavior depends on sprawling context.

Mitigations:

  • Decompose into narrow agents or states:
    • “Refund flow agent”
    • “Address change agent”
    • “Invoice classification agent”
  • Have a simple, mostly deterministic router up front:
    • Could be rules + an intent classifier LLM call.
    • Router outputs: which flow to enter, not what exact steps to take.

3. Rebuilding RPA with LLMs instead of using APIs

Teams often:

  • Keep screen-scraping legacy UIs.
  • Use LLMs to “understand” pages and click buttons via a browser automation bot.

This is marginally less brittle than classic RPA but still fragile and opaque.

Better options (if you can):

  • Wrap internal systems with stable, minimal APIs.
  • Let the LLM call those APIs, not click screens.
  • Where APIs don’t exist and can’t be built, limit these flows to low-risk cases and keep humans in the loop.

4. Ignoring organizational constraints

Automation fails not only technically but socially:

  • Finance refuses to sign off because there’s no clear audit trail.
  • Security blocks rollout due to uncontrolled data flows.
  • Ops teams don’t trust black-box changes to critical systems.

Mitigations:

  • Design for auditability: log which tools were called, with what params, and why.
  • Maintain a policy document: what the agent is and is not allowed to do.
  • Involve security/compliance early; treat the agent as a privileged service, not a toy.

5. Over-indexing on “top model, big context window”

Swapping models and maxing context is not a strategy.

Risks:

  • Unpredictable cost growth.
  • Latency spikes.
  • Having to re-validate everything after model provider changes behavior.

Mitigations:

  • Use tiered model selection:
    • Small/cheap models for classification, routing, extraction.
    • Larger models for complex reasoning or high-value actions.
  • Minimize context via:
    • Retrieval (RAG) with tight filters.
    • Pre-computed summaries / embeddings.
  • Design your automation so you can change models without rewriting flows.

Practical playbook (what to do in the next 7 days)

Assuming you’re a tech lead or CTO with at least one candidate workflow:

Day 1–2: Pick a narrow, painful workflow

Look for:

  • High manual repetition.
  • Clear inputs/outputs.
  • Tolerable risk if errors happen (e.g., drafts, suggestions, internal tasks).

Examples:

  • Classifying and routing inbound emails/tickets.
  • Drafting responses for common support categories.
  • Extracting and validating fields from invoices or contracts.
  • Preparing refund recommendations within strict bounds.

Define:

  • Current process steps.
  • Failure modes and their cost.
  • “Safe maximum autonomy” for the first version.

Day 2–3: Define the state machine and tools

  1. Write a simple state diagram (can be a text doc):

    Example: Support refund flow

    • State: Intake → extract ticket data + classify intent.
    • State: EligibilityCheck → confirm user, purchase, timeframe.
    • State: Proposal → decide refund amount within business rules.
    • State: Approval → auto-approve low amounts, send higher ones for review.
    • State: Execution → call refund API, notify user, log action.
  2. For each state, define tools:

    • Read-only: get_customer_history, get_purchase_details.
    • Mutating: create_refund, send_email_template.
    • Validation: check_refund_policy, check_max_refund.
  3. Define what LLM decides in each state:

    • Maps free-text → structured fields.
    • Chooses which tool(s) to call, with what parameters.
    • Generates human-readable explanations.

Day 3–4: Build a thin vertical slice

  • Hard-code one or two main paths (e.g., common refund scenarios).
  • Implement:
    • Tool layer with strong validation.
    • LLM prompts for each state, with strict expected JSON outputs.
    • Logging of every step (prompts, tool calls, outputs, decisions).
  • Run synthetic test cases:
    • “Happy path”
    • Known edge cases
    • Adversarial-ish inputs (empty fields, conflicting info, partial IDs).

Day 4–5: Put a human in the loop

Deploy in a shadow mode or “AI suggestion only” mode:

  • The agent proposes:
    • Classification.
    • Refund amount or action.
    • Outgoing messages.
  • Humans:
    • Approve, edit, or reject.
  • Capture:
    • Acceptance rate.
    • Types of corrections.
    • Systematic failures.

Aim for 20–50 real cases before changing

Similar Posts