Stop Calling It an “Agent” If It’s Just a Cron Job with a LLM

Table of Contents

Why this matters this week

Two things are colliding in production environments right now:

LLMs are finally good enough (and cheap enough) to sit in the middle of real workflows.
The RPA/“screen scraping” band-aids you put in 2017 are starting to fail in ways that are expensive, opaque, and hard to fix.

If you run a real business system (contact center, back-office ops, logistics, finance, internal tools), you’re probably hearing:

“Let’s build an AI agent to handle this workflow end-to-end.”
“We can replace this brittle RPA bot with a copilot that just uses the APIs.”
“We’ll just orchestrate multiple tools; the AI will figure it out.”

The risk isn’t that this is impossible. The risk is that you build:

A demo that looks magical but fails in production variance.
An automation that quietly corrupts data.
A system that’s impossible to debug or get past security review.

This post is about the actual mechanics and trade-offs of AI automation in production: agents, copilots, orchestration, and replacing RPA in real businesses, not slideware.

What’s actually changed (not the press release)

Three non-hype shifts matter for people who ship systems:

LLMs are now “good enough glue” between tools
- You no longer need 100% deterministic rule trees for every branch.
- LLMs can handle:
  - Parsing semi-structured inputs (emails, PDFs, chats).
  - Mapping them onto internal schemas.
  - Selecting which internal APIs to call in what order.
- This doesn’t remove the need for structure; it moves it:
  - From “if-else trees in code” to “tool schemas + guardrails + state machines.”
Cost and latency are acceptable for inner-loop automation
- With modern models + streaming:
  - Sub-500ms “decision steps” are achievable.
  - Per-step cost can be in low cents or fractions of a cent, depending on model.
- That makes AI automation viable inside:
  - Customer support flows.
  - Finance ops (invoice processing, approvals).
  - Supply chain / logistics exception handling.
Tooling for orchestration is emerging (but immature)
- There are now libraries/platforms for:
  - Tool calling (functions, actions, tools).
  - Long-running workflows (stateful agents).
  - Retrieval-augmented generation (RAG) for context.
- The real change: you can compose LLM calls, tools, and human approvals without building a workflow engine from scratch.
- The catch: you can also create distributed, probabilistic spaghetti very quickly.

What did not magically change:

Reliability is still not “set and forget.” You need monitoring, drift detection, and rollback.
Compliance is still your problem: data egress, PII, audit trails.
Integration cost is still high. AI doesn’t eliminate your systems’ weirdness; it just papers over some of it.

How it works (simple mental model)

A workable mental model for AI automation in production:

1. Think “state machine + stochastic steps”, not “autonomous agent”

Forget the marketing diagrams of agents “wandering around” solving problems.

Instead:

Define a finite set of states for your workflow.
At each state, define:
- Inputs (structured + unstructured).
- Permitted actions/tools.
- Exit conditions to move to the next state (or fail/ escalate).
Use the LLM to:
- Interpret unstructured inputs into state-relevant fields.
- Choose among allowed tools/actions.
- Generate outputs visible to humans (emails, tickets, explanations).

The LLM is a bounded policy inside a state, not your entire orchestration engine.

2. Tools are the real “API surface” of your business logic

Instead of letting the LLM “do anything,” you expose:

Deterministic tools:
- create_ticket, update_order_status, fetch_invoice, calculate_refund.
Idempotent reads whenever possible:
- Separate “read” and “write” tools.
- Avoid tools that both query and mutate in one call.
Validation tools:
- validate_address, check_policy_compliance, check_budget_limit.

The agent doesn’t “understand your business”; it just sequences tools you define.

3. Guardrails = contracts, not vibes

Guardrails that actually work look like:

JSON schemas for tool inputs/outputs.
Hard business constraints in the tool implementations:
- “Refund cannot exceed last 90 days of charges.”
- “Cannot change account owner without verified identity token.”
Policy checks outside the LLM:
- Approvals for high-risk steps.
- Rate limits and anomaly detection on certain tools.

Prompting is not your primary safety mechanism. Code and contracts are.

4. Humans remain part of the workflow

Turn “autonomy” into a tunable spectrum:

Low-risk, low-value actions → fully automated.
Medium-risk → “AI drafts, human confirms”.
High-risk → “AI preps context, human decides”.

You change the automation level per state based on observed performance and business risk, not on a single “on/off” switch.

Where teams get burned (failure modes + anti-patterns)

1. Treating LLM steps as if they were deterministic

Symptoms:

No explicit retry logic because “it worked in staging.”
Downstream code assumes fields will always be present and correctly formatted.
Silent failures: bad outputs that look plausible.

Mitigations:

Require schemas + validation on all LLM outputs.
Log both prompts and structured outputs for sampling.
Add statistical checks: volume, distribution shifts, anomaly alerts.

2. “One mega-agent” trying to do everything

Pattern:

Single agent with 20+ tools.
Ambiguous responsibilities (“it handles support, billing, and account changes”).
Debugging is impossible because behavior depends on sprawling context.

Mitigations:

Decompose into narrow agents or states:
- “Refund flow agent”
- “Address change agent”
- “Invoice classification agent”
Have a simple, mostly deterministic router up front:
- Could be rules + an intent classifier LLM call.
- Router outputs: which flow to enter, not what exact steps to take.

3. Rebuilding RPA with LLMs instead of using APIs

Teams often:

Keep screen-scraping legacy UIs.
Use LLMs to “understand” pages and click buttons via a browser automation bot.

This is marginally less brittle than classic RPA but still fragile and opaque.

Better options (if you can):

Wrap internal systems with stable, minimal APIs.
Let the LLM call those APIs, not click screens.
Where APIs don’t exist and can’t be built, limit these flows to low-risk cases and keep humans in the loop.

4. Ignoring organizational constraints

Automation fails not only technically but socially:

Finance refuses to sign off because there’s no clear audit trail.
Security blocks rollout due to uncontrolled data flows.
Ops teams don’t trust black-box changes to critical systems.

Mitigations:

Design for auditability: log which tools were called, with what params, and why.
Maintain a policy document: what the agent is and is not allowed to do.
Involve security/compliance early; treat the agent as a privileged service, not a toy.

5. Over-indexing on “top model, big context window”

Swapping models and maxing context is not a strategy.

Risks:

Unpredictable cost growth.
Latency spikes.
Having to re-validate everything after model provider changes behavior.

Mitigations:

Use tiered model selection:
- Small/cheap models for classification, routing, extraction.
- Larger models for complex reasoning or high-value actions.
Minimize context via:
- Retrieval (RAG) with tight filters.
- Pre-computed summaries / embeddings.
Design your automation so you can change models without rewriting flows.

Practical playbook (what to do in the next 7 days)

Assuming you’re a tech lead or CTO with at least one candidate workflow:

Day 1–2: Pick a narrow, painful workflow

Look for:

High manual repetition.
Clear inputs/outputs.
Tolerable risk if errors happen (e.g., drafts, suggestions, internal tasks).

Examples:

Classifying and routing inbound emails/tickets.
Drafting responses for common support categories.
Extracting and validating fields from invoices or contracts.
Preparing refund recommendations within strict bounds.

Define:

Current process steps.
Failure modes and their cost.
“Safe maximum autonomy” for the first version.

Day 2–3: Define the state machine and tools

Write a simple state diagram (can be a text doc):

Example: Support refund flow
- State: Intake → extract ticket data + classify intent.
- State: EligibilityCheck → confirm user, purchase, timeframe.
- State: Proposal → decide refund amount within business rules.
- State: Approval → auto-approve low amounts, send higher ones for review.
- State: Execution → call refund API, notify user, log action.
For each state, define tools:
- Read-only: get_customer_history, get_purchase_details.
- Mutating: create_refund, send_email_template.
- Validation: check_refund_policy, check_max_refund.
Define what LLM decides in each state:
- Maps free-text → structured fields.
- Chooses which tool(s) to call, with what parameters.
- Generates human-readable explanations.

Day 3–4: Build a thin vertical slice

Hard-code one or two main paths (e.g., common refund scenarios).
Implement:
- Tool layer with strong validation.
- LLM prompts for each state, with strict expected JSON outputs.
- Logging of every step (prompts, tool calls, outputs, decisions).
Run synthetic test cases:
- “Happy path”
- Known edge cases
- Adversarial-ish inputs (empty fields, conflicting info, partial IDs).

Day 4–5: Put a human in the loop

Deploy in a shadow mode or “AI suggestion only” mode:

The agent proposes:
- Classification.
- Refund amount or action.
- Outgoing messages.
Humans:
- Approve, edit, or reject.
Capture:
- Acceptance rate.
- Types of corrections.
- Systematic failures.

Aim for 20–50 real cases before changing

Stop Calling It an “Agent” If It’s Just a Cron Job with a LLM

Why this matters this week

What’s actually changed (not the press release)

How it works (simple mental model)

Where teams get burned (failure modes + anti-patterns)

1. Treating LLM steps as if they were deterministic

2. “One mega-agent” trying to do everything