Skip to content
Techliphant TechnologiesTechliphant Technologies
AI

Shipping AI That Actually Resolves Cases: A Field Guide

Most AI support pilots fail not on the model, but on retrieval, evals and human handoff. Here is what we have learned shipping AI support to production.

TE
Techliphant Engineering
May 12, 2026 9 min
AI
AI

The "wow" moment of a generative AI demo is cheap. The hard part is the long tail: the 5% of conversations where the model invents an answer, the 2% where it sends an action to the wrong account, and the 0.1% that turns into a public incident.

We have shipped AI support copilots to multiple production environments. Here is what consistently makes the difference between a pilot that gets killed and one that scales.

1. Retrieval is 80% of the work

The model is rarely the bottleneck. The bottleneck is whether retrieval surfaces the right two paragraphs at the right moment.

Most teams start by feeding a help centre into a vector store and calling it done. That gets you to maybe 40% first-contact resolution. To get to 80%+, you need a richer corpus and more disciplined indexing:

  • Help centre and policy documents stripped of navigation chrome, de-duplicated, and split on semantic boundaries, not arbitrary token counts.
  • Resolved ticket history with PII redacted. The answer to "how do I cancel my plan if I bought it through a reseller?" almost certainly exists in a closed ticket from eight months ago. RAG finds it. A freshly prompted model invents.
  • Live transactional data order status, subscription state, account flags, entitlements. This is not retrieved via vector search; it is fetched via tool call with the customer's ID. The agent needs to know the difference.
  • Policy snapshots with effective dates return policies change. Shipping SLAs change. Version your policy corpus and track which version was in effect when a conversation happened.

2. Build your eval set before you write a single prompt

The instinct is to write a prompt, test it manually in the playground, and iterate. The result is a system tuned to the five scenarios you thought of. Production sends you the five thousand you didn't.

Our sequencing on every support engagement:

  1. Pull 12 months of closed tickets and cluster them. You will typically find that 6–9 intent types drive 65–70% of volume.
  2. For each intent, collect 30–50 examples: half from good resolutions, half from escalations, a handful of hostile inputs.
  3. Write graders not just "did the answer sound right" but "did it cite the correct policy version", "did it avoid making a commitment about delivery dates it cannot verify", "did it detect frustration and not match frustration back".
  4. Gate deployment on eval pass rate we require 90%+ before any feature goes live.
JSON
{  "evalSet": "support.v2.nightly",  "sampleRate": 0.04,  "redactors": ["pii", "order-ids", "account-numbers"],  "graders": [    "intent-correct",    "citation-present",    "no-fabricated-commitment",    "policy-version-current",    "tone-appropriate",    "escalation-detected-when-needed"  ],  "passThreshold": 0.90,  "alertOn": ["intent-correct < 0.85", "no-fabricated-commitment < 0.97"]}

The no-fabricated-commitment threshold is deliberately tighter. A wrong tone is recoverable. A fabricated refund promise is a liability.

3. Human handoff is a product feature, not a fallback

The most-trusted AI support deployments are the ones that escalate cleanly and quickly. Customers forgive AI for not knowing something; they do not forgive AI for keeping them in a loop that clearly isn't going anywhere.

We bake three classes of escalation signal into every system:

  • Hard escalation triggers immediately: refund requests above a threshold, mentions of legal action, accessibility needs, accounts flagged for special handling.
  • Soft escalation triggers after a confidence check: frustration sentiment sustained for two or more turns, unresolvable policy edge cases, topics outside the indexed corpus.
  • Silent escalation a supervisor receives an alert without the customer seeing a transfer: when the agent is technically resolving the case but confidence is below threshold.

When a transfer happens, the agent packages everything into a handoff envelope:

Bash
$ soliq agent tail --event handoff --feature support-copilot2026-05-12T11:22:44  conv=c_Xr7m  signal=frustration+high-value2026-05-12T11:22:44  turns=6  escalation_turn=42026-05-12T11:22:44  top_chunks: [policy/returns-v3.md, faq/reseller-orders.md]2026-05-12T11:22:44  tools_attempted: [get_order_status(OK), initiate_return(BLOCKED: reseller)]2026-05-12T11:22:44  summary: "Customer purchased via reseller; return policy differs. Agent could not initiate return. Customer expressed frustration on turn 4."2026-05-12T11:22:44  routed_to: tier-2, queue=reseller-escalations

The human opens to a one-paragraph summary, the retrieved citations, and a clear note on what the AI tried and why it stopped. Cold transfers do not exist in this model.

4. Actions need exactly two safety nets

Letting an agent take actions initiate a return, send a voucher, update an address is where the real resolution rate gains live. It is also where the most expensive mistakes live.

Two non-negotiables before any tool goes into production:

Idempotency keys on every write. Language models retry. They retry creatively sometimes mid-response, sometimes across sessions. Without idempotency, a retry on a refund becomes two refunds.

Human-in-the-loop for irreversible or high-value actions. We set explicit thresholds: any refund over a defined amount pauses for human approval. The agent explains what it wants to do and why; the human approves in one click.

Python
@tool(idempotent=True, dry_run_capable=True)def initiate_return(order_id: str, items: list[str], reason: str) -> ReturnResult:    """Initiate a return. Orders via reseller channels are blocked."""    if is_reseller_order(order_id):        raise ToolError("reseller_channel", "Returns for reseller orders require manual processing.")    return oms.returns.create(        order_id=order_id,        items=items,        reason=reason,        idempotency_key=f"return:{order_id}:{hash(tuple(items))}",    )

The dry_run_capable decorator lets nightly evals exercise the full agent loop including tool calls without triggering real side effects. It is the single change that makes eval-on-real-traffic safe.

5. Multi-channel is a deployment detail, not an architecture decision

Support conversations come from web chat, WhatsApp, email, social DMs, voice IVR. The instinct is to build channel-specific bots. The right answer is one agent with channel-aware rendering.

Every channel has constraints: WhatsApp messages cannot exceed 4096 characters and cannot render markdown; voice needs short turns with no lists. The agent prompt receives channel as a context variable and the output schema enforces constraints at generation time not post-hoc with a truncation script.


The bottom line

First-contact resolution above 80% is achievable and sustainable. The work is not in the model. It is in a retrieval corpus that is honest about what it knows, evals that measure what actually matters on real traffic, handoffs that make humans faster rather than slower, and tool rails that make actions safe to automate.

If you are sitting at 40% FCR and the model keeps "hallucinating", check retrieval first. Then check whether you have an eval set at all. The model is doing its best with what you gave it.

AISupportRAGEvals
All posts

Ready when you are

Let's build something exceptional.

Tell us about your business, your stack, and the problem you are trying to solve. We respond with a clear next step usually a 30-minute discovery call, no fluff.

Shipping AI That Actually Resolves Cases: A Field Guide · Techliphant