AI Support Copilot

A support copilot generates responses to customer queries using a combination of retrieved knowledge, CRM context, and tool calls. It operates at high volume with low tolerance for error — a hallucination isn't a lab curiosity, it's a support ticket that escalates.

Detect

→

Understand

→

Fix

→

Prove

→

LLM — generates customer-facing responses
Knowledge base — product documentation, policy articles
CRM integration — account context, case history via tool call
Guardrail layer — checks response before delivery

What can go wrong

Hallucinated product facts or pricing
Incorrect tool call parameters (wrong account, wrong API)
Refusal spikes due to overly conservative guardrails
Inconsistent responses to identical queries (non-determinism at scale)
Prompt drift after A/B test or version rollout

Detect

Reliai identifies:

failure rate spike correlated with a prompt or model version change
increase in negative feedback rate (if routed back)
divergence in trace structure across model versions
refusal rate change independent of query volume change

Understand

Incident example (INC-1423)

Production copilot begins generating hallucinated answers about refund eligibility.

Failure rate: 4% → 19% over 20 minutes
Trigger: prompt v42 deployed to 100% of traffic
Impact: 340 incorrect responses delivered before detection

Root cause

Prompt v42 increased response verbosity and reduced the explicit instruction to ground answers in retrieved policy text. The model began filling gaps with parametric knowledge — which was out of date.

Reliai identified via:

prompt diff between v41 and v42
trace comparison across 80 failing vs 80 baseline requests
clustering of failures by query category (refund, cancellation)

AI vs system signals

Deterministic— root cause, metrics, traces, patterns

AI-assisted— summaries, explanations, ticket drafts

AI never decides root cause. It only explains what the system already determined.

Fix

Revert prompt to v41
Add explicit grounding instruction: "answer only from the provided context"
Reduce max response length to limit verbosity

Prove

INC-1423 — Copilot hallucination spike

19%→5% ✓

Resolved in 6 minutes · 340 affected responses · Measured across production traffic

Key takeaway

Copilot failures are often prompt + retrieval interaction issues.

A prompt that worked at lower verbosity fails at higher verbosity because the model fills context gaps differently. The only way to catch this is trace comparison across prompt versions at scale.

AI Support Copilot

System architecture

What can go wrong

Detect

Understand

Incident example (INC-1423)

Root cause

Fix

Prove

Key takeaway