Stop Calling It an “Agent” If It’s Just a Cron Job with a LLM

Why this matters this week
Two things are colliding in production environments right now:
- LLMs are finally good enough (and cheap enough) to sit in the middle of real workflows.
- The RPA/“screen scraping” band-aids you put in 2017 are starting to fail in ways that are expensive, opaque, and hard to fix.
If you run a real business system (contact center, back-office ops, logistics, finance, internal tools), you’re probably hearing:
- “Let’s build an AI agent to handle this workflow end-to-end.”
- “We can replace this brittle RPA bot with a copilot that just uses the APIs.”
- “We’ll just orchestrate multiple tools; the AI will figure it out.”
The risk isn’t that this is impossible. The risk is that you build:
- A demo that looks magical but fails in production variance.
- An automation that quietly corrupts data.
- A system that’s impossible to debug or get past security review.
This post is about the actual mechanics and trade-offs of AI automation in production: agents, copilots, orchestration, and replacing RPA in real businesses, not slideware.
What’s actually changed (not the press release)
Three non-hype shifts matter for people who ship systems:
-
LLMs are now “good enough glue” between tools
- You no longer need 100% deterministic rule trees for every branch.
- LLMs can handle:
- Parsing semi-structured inputs (emails, PDFs, chats).
- Mapping them onto internal schemas.
- Selecting which internal APIs to call in what order.
- This doesn’t remove the need for structure; it moves it:
- From “if-else trees in code” to “tool schemas + guardrails + state machines.”
-
Cost and latency are acceptable for inner-loop automation
- With modern models + streaming:
- Sub-500ms “decision steps” are achievable.
- Per-step cost can be in low cents or fractions of a cent, depending on model.
- That makes AI automation viable inside:
- Customer support flows.
- Finance ops (invoice processing, approvals).
- Supply chain / logistics exception handling.
- With modern models + streaming:
-
Tooling for orchestration is emerging (but immature)
- There are now libraries/platforms for:
- Tool calling (functions, actions, tools).
- Long-running workflows (stateful agents).
- Retrieval-augmented generation (RAG) for context.
- The real change: you can compose LLM calls, tools, and human approvals without building a workflow engine from scratch.
- The catch: you can also create distributed, probabilistic spaghetti very quickly.
- There are now libraries/platforms for:
What did not magically change:
- Reliability is still not “set and forget.” You need monitoring, drift detection, and rollback.
- Compliance is still your problem: data egress, PII, audit trails.
- Integration cost is still high. AI doesn’t eliminate your systems’ weirdness; it just papers over some of it.
How it works (simple mental model)
A workable mental model for AI automation in production:
1. Think “state machine + stochastic steps”, not “autonomous agent”
Forget the marketing diagrams of agents “wandering around” solving problems.
Instead:
- Define a finite set of states for your workflow.
- At each state, define:
- Inputs (structured + unstructured).
- Permitted actions/tools.
- Exit conditions to move to the next state (or fail/ escalate).
- Use the LLM to:
- Interpret unstructured inputs into state-relevant fields.
- Choose among allowed tools/actions.
- Generate outputs visible to humans (emails, tickets, explanations).
The LLM is a bounded policy inside a state, not your entire orchestration engine.
2. Tools are the real “API surface” of your business logic
Instead of letting the LLM “do anything,” you expose:
- Deterministic tools:
create_ticket,update_order_status,fetch_invoice,calculate_refund.
- Idempotent reads whenever possible:
- Separate “read” and “write” tools.
- Avoid tools that both query and mutate in one call.
- Validation tools:
validate_address,check_policy_compliance,check_budget_limit.
The agent doesn’t “understand your business”; it just sequences tools you define.
3. Guardrails = contracts, not vibes
Guardrails that actually work look like:
- JSON schemas for tool inputs/outputs.
- Hard business constraints in the tool implementations:
- “Refund cannot exceed last 90 days of charges.”
- “Cannot change account owner without verified identity token.”
- Policy checks outside the LLM:
- Approvals for high-risk steps.
- Rate limits and anomaly detection on certain tools.
Prompting is not your primary safety mechanism. Code and contracts are.
4. Humans remain part of the workflow
Turn “autonomy” into a tunable spectrum:
- Low-risk, low-value actions → fully automated.
- Medium-risk → “AI drafts, human confirms”.
- High-risk → “AI preps context, human decides”.
You change the automation level per state based on observed performance and business risk, not on a single “on/off” switch.
Where teams get burned (failure modes + anti-patterns)
1. Treating LLM steps as if they were deterministic
Symptoms:
- No explicit retry logic because “it worked in staging.”
- Downstream code assumes fields will always be present and correctly formatted.
- Silent failures: bad outputs that look plausible.
Mitigations:
- Require schemas + validation on all LLM outputs.
- Log both prompts and structured outputs for sampling.
- Add statistical checks: volume, distribution shifts, anomaly alerts.
2. “One mega-agent” trying to do everything
Pattern:
- Single agent with 20+ tools.
- Ambiguous responsibilities (“it handles support, billing, and account changes”).
- Debugging is impossible because behavior depends on sprawling context.
Mitigations:
- Decompose into narrow agents or states:
- “Refund flow agent”
- “Address change agent”
- “Invoice classification agent”
- Have a simple, mostly deterministic router up front:
- Could be rules + an intent classifier LLM call.
- Router outputs: which flow to enter, not what exact steps to take.
3. Rebuilding RPA with LLMs instead of using APIs
Teams often:
- Keep screen-scraping legacy UIs.
- Use LLMs to “understand” pages and click buttons via a browser automation bot.
This is marginally less brittle than classic RPA but still fragile and opaque.
Better options (if you can):
- Wrap internal systems with stable, minimal APIs.
- Let the LLM call those APIs, not click screens.
- Where APIs don’t exist and can’t be built, limit these flows to low-risk cases and keep humans in the loop.
4. Ignoring organizational constraints
Automation fails not only technically but socially:
- Finance refuses to sign off because there’s no clear audit trail.
- Security blocks rollout due to uncontrolled data flows.
- Ops teams don’t trust black-box changes to critical systems.
Mitigations:
- Design for auditability: log which tools were called, with what params, and why.
- Maintain a policy document: what the agent is and is not allowed to do.
- Involve security/compliance early; treat the agent as a privileged service, not a toy.
5. Over-indexing on “top model, big context window”
Swapping models and maxing context is not a strategy.
Risks:
- Unpredictable cost growth.
- Latency spikes.
- Having to re-validate everything after model provider changes behavior.
Mitigations:
- Use tiered model selection:
- Small/cheap models for classification, routing, extraction.
- Larger models for complex reasoning or high-value actions.
- Minimize context via:
- Retrieval (RAG) with tight filters.
- Pre-computed summaries / embeddings.
- Design your automation so you can change models without rewriting flows.
Practical playbook (what to do in the next 7 days)
Assuming you’re a tech lead or CTO with at least one candidate workflow:
Day 1–2: Pick a narrow, painful workflow
Look for:
- High manual repetition.
- Clear inputs/outputs.
- Tolerable risk if errors happen (e.g., drafts, suggestions, internal tasks).
Examples:
- Classifying and routing inbound emails/tickets.
- Drafting responses for common support categories.
- Extracting and validating fields from invoices or contracts.
- Preparing refund recommendations within strict bounds.
Define:
- Current process steps.
- Failure modes and their cost.
- “Safe maximum autonomy” for the first version.
Day 2–3: Define the state machine and tools
-
Write a simple state diagram (can be a text doc):
Example: Support refund flow
State: Intake→ extract ticket data + classify intent.State: EligibilityCheck→ confirm user, purchase, timeframe.State: Proposal→ decide refund amount within business rules.State: Approval→ auto-approve low amounts, send higher ones for review.State: Execution→ call refund API, notify user, log action.
-
For each state, define tools:
- Read-only:
get_customer_history,get_purchase_details. - Mutating:
create_refund,send_email_template. - Validation:
check_refund_policy,check_max_refund.
- Read-only:
-
Define what LLM decides in each state:
- Maps free-text → structured fields.
- Chooses which tool(s) to call, with what parameters.
- Generates human-readable explanations.
Day 3–4: Build a thin vertical slice
- Hard-code one or two main paths (e.g., common refund scenarios).
- Implement:
- Tool layer with strong validation.
- LLM prompts for each state, with strict expected JSON outputs.
- Logging of every step (prompts, tool calls, outputs, decisions).
- Run synthetic test cases:
- “Happy path”
- Known edge cases
- Adversarial-ish inputs (empty fields, conflicting info, partial IDs).
Day 4–5: Put a human in the loop
Deploy in a shadow mode or “AI suggestion only” mode:
- The agent proposes:
- Classification.
- Refund amount or action.
- Outgoing messages.
- Humans:
- Approve, edit, or reject.
- Capture:
- Acceptance rate.
- Types of corrections.
- Systematic failures.
Aim for 20–50 real cases before changing
