The demo always works. The Monday-morning rollout is where reality lives the five percent of conversations the model invents, the two percent where the action lands on the wrong account, the rare conversation that turns into an incident. After shipping a few dozen LLM features into real production environments, the pattern is consistent: the model is never the bottleneck.
This is the playbook we run on every LLM build at Techliphant. None of it is glamorous. All of it is load-bearing.
1. Retrieval is the product
Before you tune a prompt, tune retrieval. We have watched teams spend three weeks rewording system prompts when the actual fix was: stop embedding the page footer.
Our default ingest pipeline strips boilerplate, splits on semantic boundaries, and stores citations alongside the embedding so the model can prove its work:
import { embed, stripBoilerplate, splitBySemanticBoundaries } from '@soliq/retrieval'; export async function ingest(doc: SourceDoc): Promise<Chunk[]> { const cleaned = stripBoilerplate(doc.body); const sections = splitBySemanticBoundaries(cleaned); return Promise.all( sections.map(async (s) => ({ id: hash(`${doc.id}:${s.heading}`), docId: doc.id, heading: s.heading, text: s.text, embedding: await embed(s.text), tenantId: doc.tenantId, // row-level ACL })), );}A few rules we never bend:
- Strip boilerplate first. Footers, nav chrome, signatures they poison both vector and keyword ranking.
- Split on semantics, never on fixed token counts. Section breaks beat a 512-token guillotine every time.
- Hybrid ranking keyword and vector and recency, scored together. Pure vector loses on identifiers; pure keyword loses on paraphrase.
- Carry citations. Every chunk knows its source; every answer can be audited.
When in doubt, look at what was retrieved
Before blaming the prompt, log the top-k chunks for the failing question. Nine times out of ten the model is doing its best with garbage input.
2. Wrap every tool call in two safety nets
Letting an LLM hit your APIs is where the value lives. It is also where the lawsuits live. The non-negotiables:
- Idempotency keys on every action. Models retry. They retry creatively. Without idempotency you get duplicate refunds, duplicate emails, duplicate tickets.
- Dry-run mode during evals. Otherwise your nightly evals will refund your own customers. Not hypothetical close enough that we built the rail before someone learned the hard way.
from soliq.tools import tool, require_human_approval @tool(idempotent=True, dry_run_capable=True)def refund_order(order_id: str, amount: float, reason: str) -> RefundResult: """Refund an order. Amounts over $500 require human approval.""" if amount > 500: return require_human_approval( action="refund_order", payload={"order_id": order_id, "amount": amount, "reason": reason}, ) return stripe.refunds.create( order=order_id, amount=int(amount * 100), metadata={"reason": reason, "agent": "tiassist"}, )The cheapest action is the one that does not happen by accident.
3. Evals on real traffic, not invented scenarios
Synthetic eval sets are training wheels. The day you ship to production, sample real conversations, redact PII, and grade them. The point is not to grade the model it is to grade the system: retrieval, prompt, tools, and guardrails together.
Our cadence:
- Per-PR: a fast smoke set of about 50 cases. Catches regressions in minutes.
- Nightly: a sampled, redacted slice of real traffic, around 500 conversations. Catches drift.
- Weekly: a hostile set jailbreaks, prompt injection, edge-case inputs. Catches creativity.
A sample run config looks like this:
{ "evalSet": "tiassist.nightly", "sampleRate": 0.05, "redactors": ["pii", "secrets", "internal-ids"], "graders": ["correctness", "tone", "citation-presence", "policy-compliance"], "passThreshold": 0.92, "diffAgainst": "last-7-day-rolling"}If the nightly drops below threshold, the next deploy is blocked automatically. No exceptions, no overrides without a written reason.
4. Handoffs are first-class, not an exception path
The most trusted bots are the ones that escalate cleanly. We bake the handoff into the agent loop when confidence drops, the agent emits a structured event:
# from the agent runtime logs$ soliq agent tail --feature tiassist --event handoff2026-05-22T09:14:08 conv=c_9F2k reason=low-citation-confidence conf=0.412026-05-22T09:14:08 enriched with: last-3-turns, top-5-chunks, attempted-tools2026-05-22T09:14:08 routed to: tier-2, queue=billing-escalationsThe handoff carries the conversation, the retrieved citations, the tools the agent considered, and a one-paragraph summary of what it tried and why it stopped. A human picks up with full context and resolves faster than any cold transfer would allow.
5. Cost ceilings beat cost optimization
Optimizing model cost before you have shipped is premature. Capping it is not.
Every Soliq-routed call carries a hard ceiling: per-conversation, per-tenant, per-day. When the ceiling is hit, the agent degrades gracefully shorter context window, cheaper model, eventual handoff instead of running up an overnight bill because someone pasted a 200-page PDF into chat.
The bottom line
The model is becoming a commodity. The advantage lives one layer up:
- Retrieval boring, unglamorous, decisive.
- Tool rails idempotency, dry-run, ceilings.
- Evals on real traffic, with redaction, blocking deploys.
- Handoffs first-class, context-rich, fast.
Get those right and your LLM feature will still be running the Monday after the launch press release. If any of this is useful and you would like to compare notes, we are easy to reach: connect@techliphant.com.
