Stop Gluing Scripts Together: A Practical Look at AI Automation That Actually Ships

Why this matters this week
Executives keep walking into engineering reviews asking some version of:
“Why can’t we just have an AI agent that does this end-to-end?”
Meanwhile, you’re staring at:
- A decade of cron jobs and brittle RPA macros
- A support team that still retypes data between systems
- A cost line for “AI experiments” that’s creeping up with no clear ROI
What changed in the last 3–6 months is not “AGI soon.” It’s that:
- Foundation models have become good enough at unstructured glue work (reading emails, PDFs, logs, UI states).
- Tooling has emerged to treat LLMs as components in workflows, not chatbots.
- Vendors are trying to sell “agents” as a drop‑in replacement for RPA, often skipping over reliability, observability, and controls.
If you own production systems, the question is not “Should we use agents?”
It’s:
“Which 5–10% of our workflows are now cheaper and safer to automate with LLM‑driven workflows than with more RPA and headcount — and how do we do that without creating a new ball of mud?”
This post focuses on that: concrete mechanisms, not aspirational decks.
What’s actually changed (not the press release)
Three real shifts behind the “AI agents” narrative:
-
LLMs can now act as a semi-reliable parser/executor for messy inputs
- Reading emails, PDFs, and logs and turning them into structured intents + parameters is no longer sci‑fi.
- This replaces piles of regex, brittle XPath, and visual locators that broke on every UI change.
- Example pattern:
- Before: Selenium bot scrapes invoice PDF → regex for amounts → posts to ERP.
- After: LLM + vision model extracts line items and metadata into a normalized schema, with confidence scores.
-
Tool calling / function calling is standard
- Models from major providers now have first‑class tool calling: you give them a schema of actions they can take (e.g.,
create_ticket,update_invoice,send_email). - This lets you build AI copilots that operate inside your stack rather than copy-pasting from a chat window.
- Key change: the model doesn’t just output text; it chooses which tools to invoke and with what parameters.
- Models from major providers now have first‑class tool calling: you give them a schema of actions they can take (e.g.,
-
Agent frameworks and workflow orchestrators exist, but they’re immature in ops terms
- Frameworks (LangChain, Temporal-based systems, custom DAG tools) make it easier to:
- Maintain state across steps
- Retry and branch
- Log and replay
- But: few have first-class concepts for:
- Guardrails on blast radius (“this agent can only touch sandbox accounts”)
- Change management for prompts/workflows
- Deterministic testing across model upgrades
- Frameworks (LangChain, Temporal-based systems, custom DAG tools) make it easier to:
So you now have reasonable building blocks to replace some RPA and manual glue with AI automation.
What you don’t have is an out-of-the-box “AI employee” you can trust on day one.
How it works (simple mental model)
Forget “agents with goals and memory.” For real businesses, a more useful mental model is:
LLM = probabilistic state machine + planner that:
– Interprets messy inputs
– Chooses next actions from a fixed toolbox
– Proposes state updates
– Gets supervised by hard business rules and human review
Concretely, typical architecture for AI automation in production:
-
Trigger layer
- Something happens:
- New email arrives in support inbox
- New document hits an S3 bucket
- CRM ticket moves to specific status
- You publish an event:
support_incoming_email_created,invoice_uploaded, etc.
- Something happens:
-
Orchestration / workflow engine
- A workflow engine (Temporal, Airflow, Step Functions, or a custom worker) subscribes to that event.
- It runs a deterministic skeleton:
- Step 1: Call LLM to classify intent / extract entities
- Step 2: Validate output against schemas & rules
- Step 3: Decide path: auto‑resolve vs. human‑in‑loop
- Step 4: Execute tool calls (APIs, DB writes, notifications)
The LLM is a subroutine, not the orchestrator of everything.
-
LLM “brain” with tools
You pass the model:
- The current state (structured)
- Limited context (relevant history, docs, policies)
- A list of tools it’s allowed to call, with:
- Names
- Schemas (types, constraints)
- Descriptions
- A system prompt that defines:
- Objectives
- Constraints (never refund > $X, never modify production without tag Y)
- Style (e.g., “always propose, never execute destructive actions directly”)
The model returns either:
- A structured tool call (
create_credit_memowith parameters) - Or a natural language response to send to a human
-
Guardrails + execution
Before executing a proposed action:
- Validate JSON against schemas
- Apply business rules (limits, allowed transitions)
- Check for anomalies (e.g., large deviation from historical patterns)
- Possibly route to a human approver
-
Feedback & learning loop
- Log:
- Inputs
- Model decisions
- Human overrides
- Use that for:
- Fine-tuning / prompt adjustment
- New explicit rules (e.g., specific vendor exceptions)
- Simulation before rollout of prompt or model changes
- Log:
This keeps the LLM in a bounded, observable box. That’s how you get reliability good enough for real operations.
Where teams get burned (failure modes + anti-patterns)
Patterns seen repeatedly in real deployments:
-
“Let the agent figure it out” orchestration
- Anti-pattern: The LLM picks its own tools, decides when it’s “done,” and writes back to multiple systems without external constraints.
- Failure modes:
- Infinite loops or excessive tool calls
- Partial updates (one system changed, another not)
- Hard‑to‑replay sequences when things go wrong
- Fix:
- Orchestrator stays deterministic:
- It decides “what phase we are in”
- LLM only decides “what to do within this phase”
- Orchestrator stays deterministic:
-
No blast radius control
- Anti-pattern: Connecting agents directly to production systems with full privileges.
- Failure modes:
- Bulk updates gone wrong (e.g., wrong price applied to thousands of products)
- Compliance violations (e.g., sending PII in free‑form prompts to third‑party APIs)
- Fix:
- Shadow mode first: write to a “proposed_changes” table, not the real one.
- Scope per use case:
- Read‑only on prod
- Write to staging / sandbox or via explicit approval queues.
-
Treating prompts as comments, not as code
- Anti-pattern: Prompt edits directly in a UI with no versioning, review, or tests.
- Failure modes:
- A “small tweak” changes behavior in another edge case you never retest.
- You can’t correlate behavior regressions to a particular prompt change.
- Fix:
- Treat prompts and tool specs as config in source control:
- Code review for prompt changes
- Tag prompts with semantic versioning
- Run regression tests on representative scenarios before deployment
- Treat prompts and tool specs as config in source control:
-
Misaligned economics
- Anti-pattern: Running a “smart agent” over trivial, high‑volume tasks where a deterministic script would be cheaper and more reliable.
- Failure modes:
- LLM spend scales linearly with volume
- Latency increases vs. simple CRUD APIs
- Fix:
- Use LLMs for:
- Unstructured inputs
- Messy exception handling
- Complex classification / extraction
- Use RPA / scripts / standard APIs for:
- Stable, structured flows
- High‑volume, low‑variability operations
- Use LLMs for:
-
No story for observability
- Anti-pattern: Treating the system like a black box — just “it seems to work.”
- Failure modes:
- Silent drift as upstream email formats or docs change
- Ops only notice when customers complain
- Fix:
- Metrics: per workflow
- Automation rate
- Human escalation rate
- Error / rollback rate
- Cost per task
- Logs:
- Input → model output → tools called → final state
- Surface this like any other service: dashboards, alerts, traces
- Metrics: per workflow
Practical playbook (what to do in the next 7 days)
Don’t “adopt AI agents.” Pilot one narrow workflow with clear boundaries.
1. Identify a candidate workflow (2–4 hours)
Look for tasks with these traits:
- High volume, moderate complexity, lots of manual toil
- Unstructured inputs but structured outputs
- Low to moderate risk if something goes wrong (you can revert)
- Examples:
- Classifying and triaging inbound support tickets
- Extracting structured data from vendor invoices
- First‑pass KYC/AML document checks with human review
- Drafting responses for common customer questions, with agent assist
Validate with numbers:
- Volume per week
- Current average handling time
- Error/rework rate
- Estimated cost per task
2. Map the workflow as a state machine (2–3 hours)
Ignore LLMs initially:
- Write states on a whiteboard:
received,parsed,validated,action_proposed,action_approved,executed,failed
- For each state:
- Inputs
- Deterministic rules
- Where human review is required
- Mark where you must keep deterministic logic (e.g., compliance checks, monetary limits).
This is your orchestration skeleton.
3. Insert LLM only where it adds unique value (4–6 hours)
Typically 1–2 places:
- Interpretation:
- Classify intent / type
- Extract fields from messy content into a schema
- Planning/proposal:
- Given state + rules, propose an action (e.g., “issue 10% refund” + justification)
For each LLM call:
- Define input schema
- Define output schema
- Define a simple validation function (rejects low‑confidence or malformed outputs)
4. Wrap tools with hard constraints (4–6 hours)
For every system the agent will touch (ticketing, billing, CRM):
- Create narrow tools:
