Stop Calling It “Agents” If It’s Just a Cron Job with a GPT Call

Why this matters this week
AI automation quietly crossed a threshold in the last ~90 days:
- It’s now cheaper than a human for a growing set of narrow, well-scoped workflows.
- It’s more robust than brittle RPA for anything involving unstructured text, screenshots, or variable layouts.
- It’s integrating into production via APIs, event buses, and workflow engines—not just living in browser extensions and prototypes.
You’re seeing:
- “Agents” promising to run whole back-office processes.
- Copilots that don’t just suggest text, but actually submit tickets, send emails, and update CRM.
- Workflow orchestrators that claim to replace RPA bots with LLM-powered flows.
For a technical leader, the question is no longer “Can we do this?” but:
- Where is AI automation actually ready to own a production workflow end-to-end?
- What architecture keeps this from becoming the new RPA hairball?
- How do you keep costs, latency, and blast radius within reason?
This post is about that: not “AI agents” generically, but concrete patterns that replace brittle RPA and glue code with something maintainable.
What’s actually changed (not the press release)
Three practical shifts matter for real businesses:
1. Models are “good enough” at structured extraction and constrained generation
The meaningful shift isn’t “IQ 155 vs 140.” It’s:
-
Reliably parse semi-structured mess → structured JSON
Invoices, emails, order forms, support threads, logs.
You can enforce JSON schemas, use tools/functions, and validate outputs. -
Follow templates and constraints
“Fill these fields. Use only these values. Don’t touch anything else.”
This is what you need to safely drive downstream systems.
This unlocks a lot of RPA use cases that were previously fragile due to layout or text variability.
2. Tool calling + orchestration frameworks are viable, not science projects
The ecosystem matured around:
- Tool calling / function calling in mainstream LLM APIs.
- Workflow engines (Temporal, Airflow, Step Functions, etc.) being used as the outer orchestrator around LLM steps.
- Vector search and retrieval as standard plumbing, not research.
This lets you treat “call the model + tools” as just another node in a workflow, with retries, timeouts, metrics, and idempotency—things RPA tools have been weak at.
3. Cost/latency are now often competitive with low-skill human processing
For many back-office tasks:
- Inference cost: a few cents per document / interaction
- Latency: sub-second to a few seconds, depending on complexity
- Comparable to or cheaper than:
- Offshored BPO
- Manual triage queues
- Basic internal operations teams
You don’t replace all humans. You compress the human surface area and move them up the stack (exception handling, auditing, process improvement).
How it works (simple mental model)
Strip the hype. A production AI automation stack usually looks like:
-
Trigger
- Event in your system: new support email, uploaded document, form submission, webhook.
- Scheduled job: daily reconciliation, nightly compliance checks.
-
Ingestion + Normalization
- Convert incoming artifacts to a standard form:
- PDFs → text + images
- Emails → plain text + metadata
- Screenshots → OCR text
- Store the raw artifact (you’ll want it later for audits).
- Convert incoming artifacts to a standard form:
-
LLM Step(s): Perception + Decision + Draft Plan
At least one of:- Classification: route, priority, type, risk.
- Extraction: capture structured fields into a schema.
- Planning: decide which actions to take (“update CRM, respond to user, create ticket”).
-
Tool Calls / Action Execution
The model doesn’t “do everything.” It:- Emits intent:
{"action": "create_invoice", "fields": {...}} - Calls tools via:
- Internal APIs (CRM, ERP, ticketing)
- External services (payments, messaging)
Tools are deterministic; model is advisory.
- Emits intent:
-
Guardrails + Policy Layer
Before actions are committed:- Schema validation
- Business rule checks (limits, approvals, role-based restrictions)
- Possibly a lightweight secondary model for sanity checks
-
Human-in-the-Loop (HITL) Where Needed
- Edge cases, high-risk actions, or first roll-out phase.
- UI for “approve / edit / reject” with full context.
-
Logging, Metrics, Feedback
- Log: prompts, responses, tool calls, diffs, approvals/rejections.
- Metrics: success rate, auto-resolution rate, cost, latency.
- Feedback loop: rejected actions feed fine-tuning or prompt improvements.
Think of it as:
Events → Orchestrator → LLM “brain” + deterministic tools → Guardrails → Commit or escalate.
Where teams get burned (failure modes + anti-patterns)
1. Treating the LLM as the orchestrator, not a component
Anti-pattern:
– “We’ll just let the agent decide everything: plan, call tools, loop until done.”
Failure modes:
– Non-deterministic control flow
– Infinite or wasteful tool-calling loops
– Hard-to-debug “why did it do that?” incidents
Better:
– Use your existing workflow engine for control flow.
– Use LLMs for:
– Local decisions (“which queue does this belong in?”)
– Conversions (“parse this contract into this schema”)
– Synthesis (“draft reply based on these facts”).
2. Skipping a real schema and validation layer
Anti-pattern:
– Prompt: “Please output JSON with these fields.”
– Code: JSON.parse(response) and pray.
Failure modes:
– Occasionally malformed JSON
– Subtle type drift (numbers as strings, missing fields)
– Downstream exceptions that look like “random bugs”
Better:
– Enforce a concrete JSON schema or function-calling spec.
– Validate strictly:
– If invalid → either auto-correct using a repair prompt or hard fail + escalate.
– Keep deterministic software as the source of truth for business rules.
3. Over-automation of ambiguous processes
Anti-pattern:
– Automating workflows where:
– Policies are fuzzy (“use judgment”).
– Exceptions are frequent.
– Ground truth is spread across tribal knowledge and Slack history.
Failure modes:
– “Policy” encoded in prompts that drift and are hard to audit.
– Conflicting outcomes for similar cases.
– Users lose trust because behavior seems arbitrary.
Better:
– Start with:
– Document-centric workflows (invoices, contracts, KYC docs)
– Clearly rule-bound processes (fee calculation, eligibility checks)
– Repetitive templated communication (status updates, scheduling, reminders)
4. No observability or review UX
Anti-pattern:
– Bots silently taking actions in production with log entries buried in some service.
Failure modes:
– Incidents discovered by customers, not you.
– Impossible to answer: “How often are we wrong? Where?”
Better:
– Provide:
– A single screen for operations to see:
– Last N automated decisions
– Confidence / rationale
– Quick actions: revert, escalate, mark bad/good
– Metrics out-of-the-box: volume, error rate, HITL rate, per-workflow cost.
Example pattern:
– A mid-sized SaaS company automated support triage. First attempt: fully automatic routing with no dashboard. High-value customers got delayed responses because mis-routed tickets weren’t spotted for hours. Fix was simple: an ops dashboard with “unconfirmed routing” queue and daily review.
5. Ignoring data residency, PII, and vendor lock-in
Anti-pattern:
– Uploading arbitrary internal documents or PII to whichever LLM API is convenient.
– Hard-baking prompts and tooling into a single vendor integration.
Failure modes:
– Compliance surprises when security/legal eventually reviews.
– Expensive or impossible migration later.
Better:
– Classify flows:
– PII / financial / health / regulated vs. non-sensitive.
– For sensitive:
– Prefer self-hosted or VPC-deployed models and vector stores.
– Use field-level encryption or redaction.
– Abstract:
– A thin internal interface: LLMClient, EmbeddingClient, ToolExecutor.
– Keep prompts and tools in a configuration layer, not scattered in call sites.
Practical playbook (what to do in the next 7 days)
You don’t need a “multi-agent platform” to start. You need one solid workflow in production.
Day 1–2: Inventory candidates
Walk through 2–3 business teams (support, finance, ops). Ask:
- Which workflows:
- Have >100 executions per week?
- Are currently handled by people reading text or documents?
- Have relatively clear success criteria?
Common high-yield patterns:
- Customer support triage + draft replies
- Invoice / purchase order ingestion
- Lead qualification from inbound emails or forms
- KYC / document verification pre-screens
- Internal ticket classification and routing
Pick one with:
- Low regulatory / reputational risk.
- Clear “right vs wrong” outcomes.
- Available historical examples (for evaluation).
Day 3: Model the workflow explicitly
Draw a sequence, no LLM yet:
- Trigger (what starts this?)
- Inputs (emails, PDFs, screenshots, fields)
- Decisions (what branches? what labels?)
- Actions (what systems are updated? what communications go out?)
- Exceptions (when do humans get involved?)
Identify where an LLM would help:
- Extraction step: “read invoice → structured line items.”
- Classification step: “support email → category, priority, product.”
- Drafting step: “generate first-draft response or summary.”
Everything else stays deterministic.
Day 4: Build a thin vertical slice
Using your existing stack:
- Implement the orchestrator in whatever you use (Temporal, Step Functions, plain services + queue, etc.).
- Add a single LLM step:
- Define a JSON schema or function spec.
- Add validation.
- Hardcode fallback: if validation fails → send to human unchanged.
Run it in shadow mode for a subset of real data:
- Don’t let it take action yet.
- Log:
- Model outputs
- Human decisions
- Differences between them
Day 5–6: Evaluate with real metrics
Define 3–5 concrete metrics:
- Extraction accuracy (per-field)
- Routing accuracy (% of items going to correct queue)
- Draft quality
