James BoothJun 10, 20264 min read

Why we push complexity into deterministic code

An AI step that is right 90% of the time sounds production-ready. Chain five of those steps and the workflow succeeds 59% of the time. That is not an opinion about AI. It is arithmetic: 0.9 multiplied by itself five times is 0.59. Almost every "AI agent" demo you have seen ignores this math, and almost every stalled AI project we audit died of it.

59%

success rate of five chained 90% steps (0.9 to the fifth power)

This post explains the architecture we use to beat that math. It is the same architecture running inside every system we ship, including the Breez outbound engine, and it is documented in plain text in every client repo. No secrets, just discipline.

The core conflict

Language models are probabilistic. Ask the same question twice and you can get two slightly different answers. Business operations are the opposite: same input, same output, every time, or someone gets a wrong invoice.

Most teams resolve this conflict by hoping. They put a model in the middle of a workflow, watch it behave in a demo, and ship. The model then does what probabilistic systems do, which is drift, and the failure shows up three steps downstream where nobody can trace it.

We resolve the conflict structurally instead. Anything that can be deterministic must be deterministic. The model only gets the work that genuinely requires judgment.

The three layers

Every system we build separates into three layers. We call it DOE: directives, orchestration, execution.

Layer 1: directives. Plain-text standard operating procedures, one file per workflow. Each one defines the goal, the inputs, the step-by-step process, the expected outputs, the known edge cases, and a learnings section that grows over time. A directive is readable by the client's team, not just by engineers, because plain language is the point.

Layer 2: orchestration. The model. It reads the directive, decides which tools to call and in what order, handles errors, and escalates when something falls outside the directive. This is the judgment layer, and it is deliberately thin.

Layer 3: execution. Deterministic scripts. They call the API, process the data, write the record, and return a structured result. There is no model inside an execution script. It works or it errors, loudly, and an error is a feature: it is the system telling you exactly which step failed instead of quietly producing something plausible and wrong.

Where judgment lives, where control lives

The dividing rule is simple. The model gets tasks where the right answer requires reading and weighing context: is this email reply genuine interest or a polite brush-off, what does this company's situation suggest they need, how should this message be phrased. Code gets everything else: moving data, deduplicating records, calling external services, tracking state, retrying failures, touching anything that involves money.

This is why the 59% math stops applying. The chain is no longer five probabilistic steps. It is one or two judgment calls surrounded by deterministic stages that either succeed or halt. Errors stop compounding because the stages that used to silently absorb them now refuse to run on bad input.

Humans hold the gates

There is a third participant besides the model and the code: a person, placed exactly where a mistake would be expensive or irreversible.

In our own outbound pipeline, everything from reply detection through research, strategy, and deck assembly runs automatically. Then it stops. The finished message and deck land in Slack and wait for a human to review and send. The machine never negotiates with a live deal. The same pattern holds in every client build: automation runs to the edge of consequence, and a person owns the crossing.

This is not a concession to nervous clients. It is what the reliability engineering demands. The cheapest place to catch an error is the gate before it leaves the building.

The system gets stronger when it breaks

Deterministic scripts fail too. APIs change, rate limits appear, an edge case nobody predicted shows up. The difference is what happens next.

When a script errors, the fix follows a loop we call self-annealing: read the error, fix the script, test the fix, then write what was learned into the directive's learnings section. A real example from our research pipeline: scraped website content was blowing past token limits, so the script now truncates it, and the directive records the limit so no future build rediscovers it the hard way. Every failure becomes a permanent upgrade. A prompt cannot do that. A documented system can.

The side effect clients care about most: because behavior lives in plain-text directives, the client's team can change it. At Breez, tone, targeting, and follow-up rules are text edits the team makes themselves, no engineering ticket required.

What to ask any vendor

If someone proposes an AI workflow for your business, three questions expose the architecture in about two minutes. Where exactly does the model make decisions, and what happens around those points? What happens when a step fails, specifically, and where would you look? And can my team read and change the system's operating rules without you?

Vague answers to those questions are the 59% math waiting to happen. The insights are free. If you want this level of engineering pointed at your operation, start with the free audit. The plan is yours to keep either way.