AI

The demo always works. The Monday-morning rollout is where reality lives the five percent of conversations the model invents, the two percent where the action lands on the wrong account, the rare conversation that turns into an incident. After shipping a few dozen LLM features into real production environments, the pattern is consistent: the model is never the bottleneck.

This is the playbook we run on every LLM build at Techliphant. None of it is glamorous. All of it is load-bearing.

1. Retrieval is the product

Before you tune a prompt, tune retrieval. We have watched teams spend three weeks rewording system prompts when the actual fix was: stop embedding the page footer.

Our default ingest pipeline strips boilerplate, splits on semantic boundaries, and stores citations alongside the embedding so the model can prove its work:

soliq/retrieval/pipeline.ts

TypeScript

1import { embed, stripBoilerplate, splitBySemanticBoundaries } from '@soliq/retrieval';2 3export async function ingest(doc: SourceDoc): Promise<Chunk[]> {4  const cleaned = stripBoilerplate(doc.body);5  const sections = splitBySemanticBoundaries(cleaned);6 7  return Promise.all(8    sections.map(async (s) => ({9      id: hash(`${doc.id}:${s.heading}`),10      docId: doc.id,11      heading: s.heading,12      text: s.text,13      embedding: await embed(s.text),14      tenantId: doc.tenantId, // row-level ACL15    })),16  );17}

Retrieval pipeline: source HTML to semantic chunks to indexed embeddings — The three stages we never skip boilerplate stripping, **semantic** splitting, and an indexed hybrid store.

A few rules we never bend:

Strip boilerplate first. Footers, nav chrome, signatures they poison both vector and keyword ranking.
Split on semantics, never on fixed token counts. Section breaks beat a 512-token guillotine every time.
Hybrid ranking keyword and vector and recency, scored together. Pure vector loses on identifiers; pure keyword loses on paraphrase.
Carry citations. Every chunk knows its source; every answer can be audited.

When in doubt, look at what was retrieved

Before blaming the prompt, log the top-k chunks for the failing question. Nine times out of ten the model is doing its best with garbage input.

2. Wrap every tool call in two safety nets

Letting an LLM hit your APIs is where the value lives. It is also where the lawsuits live. The non-negotiables:

Idempotency keys on every action. Models retry. They retry creatively. Without idempotency you get duplicate refunds, duplicate emails, duplicate tickets.
Dry-run mode during evals. Otherwise your nightly evals will refund your own customers. Not hypothetical close enough that we built the rail before someone learned the hard way.

soliq/tools/refund.py

Python

1from soliq.tools import tool, require_human_approval2 3@tool(idempotent=True, dry_run_capable=True)4def refund_order(order_id: str, amount: float, reason: str) -> RefundResult:5    """Refund an order. Amounts over $500 require human approval."""6    if amount > 500:7        return require_human_approval(8            action="refund_order",9            payload={"order_id": order_id, "amount": amount, "reason": reason},10        )11    return stripe.refunds.create(12        order=order_id,13        amount=int(amount * 100),14        metadata={"reason": reason, "agent": "tiassist"},15    )

The cheapest action is the one that does not happen by accident.

3. Evals on real traffic, not invented scenarios

Synthetic eval sets are training wheels. The day you ship to production, sample real conversations, redact PII, and grade them. The point is not to grade the model it is to grade the system: retrieval, prompt, tools, and guardrails together.

Our cadence:

Per-PR: a fast smoke set of about 50 cases. Catches regressions in minutes.
Nightly: a sampled, redacted slice of real traffic, around 500 conversations. Catches drift.
Weekly: a hostile set jailbreaks, prompt injection, edge-case inputs. Catches creativity.

A sample run config looks like this:

JSON

1{2  "evalSet": "tiassist.nightly",3  "sampleRate": 0.05,4  "redactors": ["pii", "secrets", "internal-ids"],5  "graders": ["correctness", "tone", "citation-presence", "policy-compliance"],6  "passThreshold": 0.92,7  "diffAgainst": "last-7-day-rolling"8}

If the nightly drops below threshold, the next deploy is blocked automatically. No exceptions, no overrides without a written reason.

4. Handoffs are first-class, not an exception path

The most trusted bots are the ones that escalate cleanly. We bake the handoff into the agent loop when confidence drops, the agent emits a structured event:

Bash

1# from the agent runtime logs2$ soliq agent tail --feature tiassist --event handoff32026-05-22T09:14:08  conv=c_9F2k  reason=low-citation-confidence  conf=0.4142026-05-22T09:14:08  enriched with: last-3-turns, top-5-chunks, attempted-tools52026-05-22T09:14:08  routed to: tier-2, queue=billing-escalations

The handoff carries the conversation, the retrieved citations, the tools the agent considered, and a one-paragraph summary of what it tried and why it stopped. A human picks up with full context and resolves faster than any cold transfer would allow.

5. Cost ceilings beat cost optimization

Optimizing model cost before you have shipped is premature. Capping it is not.

Every Soliq-routed call carries a hard ceiling: per-conversation, per-tenant, per-day. When the ceiling is hit, the agent degrades gracefully shorter context window, cheaper model, eventual handoff instead of running up an overnight bill because someone pasted a 200-page PDF into chat.

The bottom line

The model is becoming a commodity. The advantage lives one layer up:

Retrieval boring, unglamorous, decisive.
Tool rails idempotency, dry-run, ceilings.
Evals on real traffic, with redaction, blocking deploys.
Handoffs first-class, context-rich, fast.

Get those right and your LLM feature will still be running the Monday after the launch press release. If any of this is useful and you would like to compare notes, we are easy to reach: connect@techliphant.com.

AILLMProductionEvalsTool Use

All posts

PreviousAI

Shipping AI That Actually Resolves Cases: A Field Guide

Let's build something exceptional.

Tell us about your business, your stack, and the problem you are trying to solve. We respond with a clear next step usually a 30-minute discovery call, no fluff.

Start your project Book a discovery call connect@techliphant.com

Shipping LLM Features That Survive Monday Morning

1. Retrieval is the product

When in doubt, look at what was retrieved

2. Wrap every tool call in two safety nets

3. Evals on real traffic, not invented scenarios

4. Handoffs are first-class, not an exception path

5. Cost ceilings beat cost optimization

The bottom line

Shipping AI That Actually Resolves Cases: A Field Guide

Suggested reading

Shipping AI That Actually Resolves Cases: A Field Guide

Building a Modular AI Platform: What We Learned Designing Soliq

Let's build something exceptional.