Stop Calling It an “Agent” If It’s Just a Cron Job with a LLM

Why this matters this week
The AI automation conversation has shifted from “can we?” to “should we ship this to production?”
In the past two weeks, I’ve seen three patterns repeat:
- A fintech team ripped out 60% of their RPA flows and replaced them with an LLM-driven workflow engine, cutting their “script babysitting” time by half.
- A SaaS vendor quietly rolled out an internal “copilot for ops” that now handles ~25% of their Tier-1 support tickets end-to-end (including refunds), with measured guardrails and human review.
- A manufacturing company put a multi-step “agent” in front of a legacy ERP to handle order changes. It worked great in staging, then flooded the warehouse with conflicting instructions when a schema changed in production.
Same technology family, wildly different outcomes.
This week’s reality:
AI automation (agents, copilots, orchestrated workflows) is mature enough to replace brittle RPA and rule engines for some classes of work — but only if you treat it like a distributed, partially-stochastic system, not a clever macro.
If you’re responsible for reliability, cost, or compliance, you need a clear mental model of:
- What actually changed technically in the last 6–12 months
- Where these systems fail in production
- What you can ship this week without betting the company
What’s actually changed (not the press release)
Three concrete shifts make AI automation materially different from “RPA + chatbots”:
-
LLMs are now competent state transformers, not just text generators
- Modern models can:
- Read semi-structured inputs (emails, PDFs, logs)
- Normalize them into structured schema (JSON payloads)
- Decide which tools to call in what order
- This turns a bunch of fragile regex/if-else/RPA logic into:
- “Given this state + tools, propose next action and arguments”
- Evidence: You can now reliably ask a model to:
- Parse a gnarly invoice into a typed schema
- Choose the correct internal API to call
- Handle edge cases reasonably, if you constrain the space
- Modern models can:
-
Tooling + orchestration frameworks stopped being toys
- We now have:
- Function calling / tool calling as a first-class API pattern
- Workflow engines that treat LLM calls as nodes with retries, timeouts, and circuit breakers
- Vector search + retrieval that’s good enough for many knowledge tasks
- The net effect: you can build deterministic scaffolding with probabilistic decisions inside, instead of all-or-nothing stochastic flows.
- We now have:
-
Costs and latency dropped to “operationally tolerable” for many tasks
- Running “AI as glue” between systems is now:
- Cents, not dollars, per multi-step flow in many cases
- 1–5 seconds end-to-end for moderately complex workflows
- That’s still too slow/expensive for high-frequency, low-value events, but fine for:
- Support tickets
- Back-office ops
- Partner integrations
- Exception handling previously done manually
- Running “AI as glue” between systems is now:
What hasn’t changed:
- No guarantee of correctness or consistency across calls
- No first-class transactional semantics (no atomic “all or nothing” across tools)
- No free interpretability — your “business logic” is partly inside a model you can’t inspect
Plan accordingly.
How it works (simple mental model)
Drop the “agent” buzzword. Treat AI automation as:
A workflow engine with:
1. Deterministic skeleton (state machine / DAG)
2. Probabilistic decision points (LLM calls)
3. Side-effecting tools (APIs, scripts, RPA leftovers)
A simple mental model:
-
State
-
You maintain an explicit workflow state object, e.g.:
json
{
"ticket_id": "123",
"customer_message": "...",
"parsed_intent": "...",
"account_status": "active",
"proposed_actions": [],
"audit_log": []
}
-
-
Policy / Guardrails
- You define hard constraints outside the model, e.g.:
- Max refund without human review = $100
- Never delete data without dual control
- Only call tools from an allowlist
- You define hard constraints outside the model, e.g.:
-
LLM as decision function
- At specific points, you call the model with:
- Current state
- Available tools + schemas
- Business constraints
- You ask for:
- Next action: which tool to call
- Arguments: structured JSON
- Rationale (optional, mostly for debugging)
- At specific points, you call the model with:
-
Tool execution + observation
- You execute the selected tool deterministically.
- You capture:
- Success/failure
- Response payload
- You append this to state + audit log.
-
Loop or exit
- You terminate when:
- A goal condition is met (ticket closed, order updated, incident escalated)
- A safety constraint triggers (too many steps, cost cap, policy violation)
- You hit an explicit “hand off to human” path
- You terminate when:
This is effectively an orchestrated agent:
- The orchestrator controls:
- When the LLM is called
- Maximum number of steps
- What tools are allowed
- The LLM controls:
- Which allowed tool to use next
- How to map unstructured input to structured actions
Key implication: reliability doesn’t come from the model; it comes from the orchestration, constraints, and observability around it.
Where teams get burned (failure modes + anti-patterns)
A few recurring failure modes in production AI automation:
1. “Invisible business logic inside the prompt”
Symptoms:
- A single 300-line prompt encodes your refund, escalation, and fraud rules.
- Product asks for a small policy change. You tweak the prompt. Something unrelated breaks.
Why it happens:
- It’s fast to jam logic into natural language.
- It feels flexible — until you need versioning, testing, or auditability.
Mitigation:
- Keep business invariants outside prompts:
- Limits, thresholds, roles, approval rules
- Use prompts for:
- Interpretation, classification, summarization
- Mapping state → recommended tools/arguments
- Treat prompts as code:
- Version them
- Add tests with fixed seeds and snapshots
2. Over-trusting model output for side effects
Symptoms:
- The model fabricates IDs, endpoints, or fields.
- It calls the wrong API with plausible arguments.
- Integrations “sort of work” in staging, then corrupt data in production.
Why it happens:
- Function calling encourages belief that “the model will follow the schema.”
- In reality, it follows the spirit of the schema, not the letter.
Mitigation:
- Always validate:
- JSON schema validation before side effects
- Reference checks (e.g., does this ID exist?)
- Enforce:
- Role-based access per tool
- Dry-run mode in non-prod with synthetic data
- Add “compensating actions”:
- If a step fails, record and halt, don’t guess a fallback.
3. Turning RPA spaghetti into “agent spaghetti”
Symptoms:
- You replace 200 RPA scripts with a single “smart agent” that:
- Logs into 10 systems
- Handles 20 edge-case flows
- Debugging becomes impossible. Failures look like “the agent did something weird.”
Why it happens:
- The pendulum swings from hyper-explicit flows to “let the agent figure it out.”
Mitigation:
- Decompose by business capability, not by technology:
- “Invoice matching agent”
- “Subscription change agent”
- “KYC document reviewer”
- Keep each agent’s:
- Tool set small
- Responsibilities clear
- Flows observable (traces, logs, per-step metrics)
4. No SLOs, no guardrails, no budget
Symptoms:
- Your AI automation system silently:
- Generates large cloud bills
- Introduces latency spikes into critical flows
- Or worse, it:
- Executes long, looping tool calls with no cap
Mitigation:
- Define SLOs up front:
- Max cost per workflow
- Max latency per class of request
- Max steps per run
- Enforce:
- Hard per-run ceilings
- Circuit breakers (disable automation if error rate spikes)
- Shadow mode rollouts before full autonomy
Practical playbook (what to do in the next 7 days)
If you’re a tech lead or CTO, you can move from “we should look at agents” to a concrete, low-risk pilot in one week.
Day 1–2: Identify one candidate workflow
Look for:
- High manual load, moderate volume
- Text-heavy input, structured output
- Clear success criteria
Good candidates:
- Classify + respond to specific types of support tickets
- Normalize inbound partner emails into structured requests
- Validate and route inbound forms or applications
- Resolve low-value exceptions in back-office ops
Avoid (for your first iteration):
- Direct payment movement
- Irreversible destructive operations (deletes, hard cancels)
- Anything legally sensitive without clear compliance guidance
Define:
- Target automation rate (e.g., 30–50% of cases)
- Acceptable error rate and failure modes
- “Always human” edge cases (e.g., VIP customers, large amounts)
Day 3: Sketch the orchestration, not the prompt
On a whiteboard or in code:
- Define your state object (fields, lifecycle)
- List tools:
- Fetch customer/account
- Fetch knowledge (retrieval)
- Apply action (e.g., issue refund, update subscription)
- Decide control flow:
- Where do you ask the LLM for help?
- Where are decisions purely deterministic?
You should end up with something like:
- Ingest request → enrich with customer data
- LLM: classify intent + compute proposed action
- Policy engine: validate proposal against rules
- If safe → apply action
- Else → escalate with full context to human
Day 4–5: Build a vertical slice in “shadow” mode
Implement:
- LLM calls with:
- System prompt that describes tools and constraints
- JSON-only output enforced via schema
- Tool wrappers with:
- Input validation
- Logging (inputs/outputs)
- Observability:
- Trace each workflow from input → all intermediate steps → output
- Tag with cost, latency, and outcome
Run it on real traffic in shadow mode:
- The AI
