All writing

The AI Product Stability Stack: What to Wire Up Before You Move Fast

Speed is cheap now. With modern LLM APIs and agent frameworks, a working prototype is days of work, not months. The thing that kills AI products isn't slow engineering — it's shipping fast into a void where you can't see what's breaking.

I've built and shipped LLM-powered systems across banking, travel, and fleet management. The pattern I keep returning to isn't about slowing down. It's about wiring up a thin stability stack before you let real users touch the system, so you can keep moving fast without flying blind.

Here's what that stack actually looks like.

Trace Every LLM Call From Day One

The single biggest mistake I see teams make is treating LLM calls like black boxes. You send a prompt, you get a response, you move on. That works fine until something goes wrong — and then you have no idea what prompt produced what output, under what context, with what latency.

Before anything else, I instrument every LLM call with structured traces. At minimum:

  • Input tokens, output tokens, model version — so you can catch cost anomalies and model drift
  • Latency per call — p50 and p95, not just averages
  • Prompt template name + version — because prompts are code and they change
  • Session or request ID — so you can reconstruct a full agentic chain

Tools like LangSmith, Langfuse, or even a simple structured logger into your existing observability stack (Datadog, OpenTelemetry) get this done without overhead. Pick one and make it non-negotiable.

python
import time
import logging

def traced_llm_call(client, messages, model, template_name, session_id):
    start = time.perf_counter()
    response = client.chat.completions.create(model=model, messages=messages)
    latency_ms = (time.perf_counter() - start) * 1000

    logging.info({
        "event": "llm_call",
        "session_id": session_id,
        "template": template_name,
        "model": model,
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens,
        "latency_ms": round(latency_ms, 2),
    })
    return response

This is not over-engineering. This is the minimum viable floor.

Evals Are Not Optional — They're Your Test Suite

In traditional software, you write unit tests. In LLM products, you build evals. The difference is that LLM outputs are probabilistic, so you're not asserting exact equality — you're asserting quality properties.

For most production features, I start with three eval types:

  1. Format checks — does the output match the expected schema? JSON that won't parse breaks downstream systems hard.
  2. Semantic correctness — does the output answer the actual question? Use an LLM-as-judge pattern here, but keep the judge prompt tight and versioned.
  3. Regression tests — a golden set of input/output pairs that worked well. If a prompt change breaks them, you know before it goes live.

You don't need 500 test cases on day one. A curated set of 20–30 representative inputs covering your edge cases is enough to catch regressions. Grow it every time something breaks in production.

The investment here is small. The cost of skipping it shows up when a prompt tweak you pushed on a Tuesday afternoon silently degrades the core user flow and you don't find out until users complain.

Guardrails at the Boundary, Not in the Prompt

A lot of teams try to handle every edge case inside the system prompt. That's the wrong layer. Prompts are soft — a determined input or an unexpected context will leak through. Guardrails belong at the I/O boundary of your application.

For output guardrails, at minimum I validate:

  • Structured outputs against a schema (Pydantic, JSON Schema) before they touch any downstream logic
  • Sensitive data patterns — PII, credentials, anything that shouldn't be echoed back in a response
  • Confidence or refusal signals — if the model hedges heavily or refuses, route that to a fallback or human escalation path rather than letting it degrade silently

For input guardrails, content classifiers (many model providers expose these as a cheap API call) handle obvious abuse patterns. Don't hand-roll this unless your use case is very specific.

Degrade Gracefully, Not Silently

LLM APIs go down. Rate limits get hit. Latency spikes. Your system needs to handle this without presenting the user with a broken experience and without your team finding out from a tweet.

Two things that cost almost nothing to implement and save you every time:

Circuit breakers — if a downstream LLM call fails three times in a row, stop hammering it and return a clean fallback response. Most HTTP client libraries or service mesh tooling handles this.

Fallback chains — for critical paths, have a cheaper or cached fallback. If your primary model is GPT-4o and it's slow, can you fall back to a smaller model for a degraded but functional response? Often yes.

The point isn't perfection. It's that your system tells you when something is wrong and doesn't pretend to the user that everything is fine when it isn't.

Deployment Is the Boring Part — Governance Is the Slow Part

I want to be direct about where time actually goes when shipping AI products in the real world.

The engineering is fast. A solid stability stack like this adds maybe two to three days to a greenfield build. What slows teams down is:

  • Data access — getting the right data into context, with the right permissions, through the right pipelines
  • Compliance sign-off — especially in regulated sectors like banking or healthcare in the UAE and broader MENA region
  • Internal alignment — agreeing on what the product actually does and who owns edge cases

None of that is an engineering problem. Don't let it masquerade as one. The stability stack above can be in place before those conversations are finished. Wire it up early, then use the time you'd normally spend firefighting to push those blockers forward instead.

The Actual Principle

Shipping fast and shipping stably aren't in tension — they become opposed only when you skip the observability layer. With traces, evals, boundary guardrails, and graceful degradation in place, you can push changes confidently because you'll know immediately if something moves in the wrong direction.

That's not caution. That's how you stay fast at week twelve, not just week one.

Working on something like this? I take on a few fractional-CTO and AI engagements at a time.

The AI CTO playbook

Get my AI playbooks — straight to your inbox

Practical notes on shipping production AI, scaling teams, and the calls a CTO actually has to make. A few times a month. No spam, no fluff.