Stop Calling It an “Agent” If It’s Just a Cron Job with GPT

Why this matters this week
AI “agents” and copilots are now leaving demo-land and touching real workflows: invoices, support queues, fraud queues, ETL, internal tools. The conversation is shifting from “can GPT do X?” to “can we run this safely, cheaply, and repeatably in production?”
Three concrete things I’m seeing this week across engineering orgs:
- Teams replacing brittle RPA scripts with LLM-based steps plugged into existing workflow engines (Temporal, Airflow, Step Functions, etc.).
- “Agents” being quietly downgraded from “autonomous co-worker” to “narrow task worker with guardrails and human review.”
- Cost surprises: what looked like a $2k/mo experiment becomes an unbounded token-burner once connected to production event streams.
If you’re responsible for reliability, security, or budget, this is the moment to set the architecture patterns before your org hardcodes a bunch of fragile AI automation into core business processes.
What’s actually changed (not the press release)
Under the noise, three real shifts have made AI automation in businesses more viable in the last ~6–9 months:
1. Models are good enough at “structured text in / structured text out”
For many business workflows, “reasoning” really means:
- Normalize dirty inputs (emails, PDFs, scraped HTML).
- Map them onto existing schemas (ticket fields, case codes, invoice data).
- Apply explicit rules or policies (SLA rules, escalation logic, compliance red lines).
Modern LLMs are now reliably good at:
- Schema-constrained JSON output.
- Simple tool usage (calls to your APIs for lookup/updates).
- Staying on task when prompts are stable and inputs are narrow.
This enables replacing pattern-matching RPA with AI steps that are robust to layout changes and phrasing variance.
2. Tooling and orchestration have matured enough
We’ve gone from “single chat completion” to:
- Deterministic wrappers (JSON mode, function calling, retries).
- Workflow engines that treat LLM calls as just another side-effecting step.
- Policy layers for safety (allowed tools, rate limits, approval gates).
Instead of monolithic “autonomous agents,” teams are stitching small AI tasks into existing workflow orchestrators. This looks much more like microservices + queues than like a sci-fi agent framework.
3. Data plumbing has improved inside companies
Vector databases and retrieval systems are less of an experiment and more of a commodity component. The meaningful change:
- Org data is increasingly exposed behind internal APIs or retrieval layers.
- Permissions and tenancy are at least considered (if not perfect).
- Observability is getting bolted on (traces, logs, token accounting).
This makes it realistic to have AI workflows that operate on real customer data without dumping everything in a flat S3 bucket and hoping nothing goes wrong.
How it works (simple mental model)
Forget the “agent” branding. For real businesses, what’s actually shipping looks more like this:
Event → Workflow → AI Step(s) → Decision/Action → Human + Logs
Use this mental model:
-
Event
- Something happens: an email arrives, a ticket is opened, a form is submitted, a cron triggers a batch.
- You already have this: webhooks, queues, DB triggers, scheduled jobs.
-
Workflow
- A workflow engine or orchestrator defines the steps:
- Fetch related data.
- Call AI to transform/interpret.
- Apply rules/constraints.
- Decide: auto-act or send to human.
- This can be Temporal/Airflow/Argo/Step Functions, or even a disciplined background job system.
- A workflow engine or orchestrator defines the steps:
-
AI Step
- A narrowly scoped LLM call (or 2–3 chained calls) does one of:
- Classification (route, prioritize, tag).
- Extraction (structured fields from messy input).
- Drafting (responses, summaries, explanations).
- Suggesting actions (which your code then enforces).
- This step is stateless from the perspective of the orchestrator: it gets input, returns output, no hidden agent memory.
- A narrowly scoped LLM call (or 2–3 chained calls) does one of:
-
Decision/Action
- Your non-LLM code decides:
- Whether output is valid (e.g., JSON schema validation).
- Whether it can be auto-applied (policy checks, thresholds).
- Whether a human must approve.
- Your non-LLM code decides:
-
Human + Logs
- Humans review edge cases, corrections feed back as training/eval data.
- System logs:
- Inputs/outputs (with PII handling).
- Cost and latency.
- Error types and fallback paths.
You can think of this as “RPA 2.0 with probabilistic steps”:
– RPA tried to script everything.
– AI workflows accept that some steps are probabilistic and mitigate via structure, validation, and humans-in-the-loop.
Where teams get burned (failure modes + anti-patterns)
Some recurring anti-patterns that are turning “AI automation” into reliability or cost incidents:
1. Treating the LLM as the orchestrator
Failure mode:
– You prompt “you are an autonomous agent, decide what to do,” give it tools, and let it loop.
– It does 3–10 tool calls when 1–2 would suffice.
– It occasionally deadlocks, loops, or chooses unsafe actions.
Why it hurts:
– Unbounded cost and latency.
– Hard to reason about; impossible to assert behavior for compliance.
– Debugging turns into log archaeology across dozens of tool calls.
Anti-pattern smell:
– No explicit max-steps or budget per task.
– No separate policy layer — the model both “decides what’s allowed” and “does the work.”
2. Replacing deterministic logic with “LLM reasoning”
Failure mode:
– Instead of “if SLA > 48h and customer is tier A, then escalate,” the prompt says “consider our SLAs and decide whether to escalate.”
Why it hurts:
– Business rules become non-auditable prompt text.
– Small model or prompt changes change behavior without versioned change control.
– Troubleshooting incident tickets becomes “what did the model feel like doing?”
Better:
– LLM outputs intermediate labels (“customer_tier”, “severity”) and your code applies deterministic logic.
3. Ignoring long-tail error modes
In real business operations, a 1–2% failure rate is often unacceptable:
- Fraud queue: 1% misclassification may be regulatory or financial risk.
- HR or legal comms: 1% hallucinated policy explanations create exposure.
- Finance workflows: 1% wrong field in payouts or invoicing is a nightmare.
Teams get burned when:
– They only test on happy-path examples.
– They don’t distinguish precision vs recall needs per workflow.
– They have no backstop (human review, safe fallbacks) for low-confidence cases.
4. No cost controls
Patterns that blow up:
- Unbounded retries:
- “If it doesn’t parse as JSON, just retry with the same prompt” → 10x token usage.
- Blindly using the largest model for everything:
- Triage + summarization could use a small cheap model; instead everything hits the maxed-out flagship.
- Fan-out patterns:
- For each ticket, call LLM N times (summary, tags, sentiment, suggestion) instead of a single multi-task call.
Result:
– Bills that scale with worst-case event volume, not average.
– Resistance from finance/security before you reach broader rollout.
5. Compliance and data governance as an afterthought
Failure modes:
– Logging full prompts/responses with PII to arbitrary observability tools.
– Sending regulated data to external APIs without contract or DPA review.
– No way to answer “what decisions did this AI system make last quarter and why?”
This blocks or kills expansion into higher-value workflows (finance, legal, healthcare) and triggers late-stage audits that force re-architecture.
Practical playbook (what to do in the next 7 days)
Assuming you already have basic LLM access and some internal appetite for automation, here’s a one-week, pragmatic plan.
Day 1–2: Choose one concrete workflow
Pick a process with these characteristics:
- High volume, low inherent risk.
- Human work is structured but tedious.
- Inputs are text-heavy; outputs map to existing fields or actions.
Examples I’ve seen work:
-
Support intake triage
- Input: inbound tickets/emails.
- Output: category, priority, suggested responder group, short summary.
- Automation: auto-tag and route; humans remain in the loop.
-
Invoice data extraction
- Input: invoice PDFs or HTML.
- Output: vendor, dates, line items, tax fields.
- Automation: pre-fill fields for AP team; no auto-pay at first.
-
Sales notes enrichment
- Input: call transcripts, meeting notes.
- Output: summary, next steps, risk flags.
- Automation: auto-update CRM fields; sales rep can edit.
Day 3–4: Wrap the AI step inside your existing workflow system
Do not start with a shiny agent framework. Start with what you already run in production:
- Use your current:
- Job queue / worker pool.
- Background processing system.
- Workflow/orchestration engine.
Implement:
- A single AI step as a well-defined function:
- Input: typed struct (e.g., ticket body + metadata).
- Output: typed struct validated against a JSON schema.
- Strict constraints:
- JSON-only output.
- Max tokens, temperature tuned low for reliability.
- Retries with backoff and telemetry, not infinite loops.
Define:
- Fallbacks:
- If parsing fails after N retries, mark as “needs manual classification.”
- If model response violates a policy check, drop to human.
Add basic metrics:
- Number of AI calls.
- Success vs fallback count.
- Latency percentiles.
- Estimated cost per event.
Day 5: Wire in safety and policy
Before scaling traffic:
- Define a clear “no-go” zone:
- E.g., model is not allowed to:
- Issue refunds above $X.
- Change customer contact info.
- Make irreversible financial changes.
- E.g., model is not allowed to:
- Implement:
- Explicit allowlisted tools/actions.
- Human approval steps for any risky action.
- Data filters/masking before logs and prompts.
Run a small red-team exercise:
- Ask: “How could this be abused or fail dangerously?”
- Try to prompt it into:
- Ignoring policies.
- Acting outside its domain.
- Leaking sensitive data from context.
Patch prompts, tooling, or architecture based on what you find.
Day 6: Run a side-by-side “shadow” evaluation
For 24–72 hours:
- Keep humans doing the existing workflow as usual.
- In parallel, run
