Agent Workflows

An agent system extends a model with tools, memory, and multi-step planning. Each step produces an output that becomes the input for the next. Failures cascade — a single bad tool call can consume all remaining context, drive up latency, and never complete the user's request.

Detect
Understand
Fix
Prove
Share

System architecture


What can go wrong


Detect

Reliai identifies:

Sampling active — agent traces are high volume
Dropping ~400 spans/min during loop incidents

When sampling is active during an agent loop incident, evidence may be partial. Root cause will note this.


Understand

Incident example

Production research agent enters a loop, repeatedly calling a web search tool with the same query. Each iteration consumes 2,000–4,000 tokens. Requests never complete.

Root cause

Search API v3 changed its response schema. The planner was configured to retry if results did not contain a specific field that no longer existed in the new schema. It retried indefinitely.

Reliai identified via:

AI vs system signals
Deterministic— root cause, metrics, traces, patterns
AI-assisted— summaries, explanations, ticket drafts

AI never decides root cause. It only explains what the system already determined.


Fix


Prove

INC-2204 — Agent execution loop
38s avg latency5s avg latency
Resolved in 22 minutes · Completion rate restored to 90% · Measured across 800 requests

Key takeaway

Agent failures are systemic, not single-response errors.

A single bad tool output can cascade across every step of every request in flight. The only way to detect loop patterns is trace-level analysis — log lines and error rates alone won't show you the execution structure.