Chaotic Flows: LLMs in Low‑Tolerance Workflows
When you add an LLM to a workflow, you change both the upside and the downside. The net effect is easiest to express as a shift in the workflow's value:
$$ \Delta \text{Value} = \Delta \text{Benefit} - N \sum_i f_i \cdot \Delta p_i \cdot L_i - \Delta \text{Cost of Controls} $$
where step $i$ fires $f_i$ times per run, and failures at that step cost $L_i$. The LLM changes the benefit term (richer interpretations, smoother CX), increases the failure probabilities $p_i$ (and adds failure modes that didn't exist before), and increases the cost of controls (validation layers, fallbacks, monitoring).
To see why the $\Delta p_i$ show up—and why they often dominate—it helps to treat a workflow as a finite-state machine. A deterministic workflow can be modeled as a Mealy machine.
$$ M = (S, \Sigma, \Lambda, \delta, \lambda, s_0) $$
where $S$ captures states or progress, e.g., Awaiting Evidence, Approved, Committed, inputs $\Sigma$ are events the system accepts, e.g., claim_received, payment_confirmed, and outputs $\Lambda$ are the system’s actions, e.g., issue_refund, send_notice. The transition function $\delta$ determines the next state, and the output function $\lambda$ selects the action emitted on each transition. $s_0$ is the initial state. Commit edges are transitions with irreversible real-world effects.
When we introduce a large language model into a deterministic workflow, we fundamentally change the nature of the system. The FSM, which was previously predictable and verifiable at every step, now contains stochastic components. Let me illustrate the potential set of issues with three kinds of failures:
- Input. Traditional workflows receive structured inputs with clear semantics, e.g., selection of
claim_reasonfrom a fixed set using a dropdown menu. With natural language inputs, the model must map text $x$ to a symbol $\sigma \in \Sigma$. Consider a customer saying "my package arrived late and damaged." The model must map this to the appropriate system symbols—perhapsclaim_damaged,claim_late, anditem=A. But natural language is inherently ambiguous. Does "late" mean past the promised delivery date or later than the customer expected? Does "damaged" refer toclaim_damaged(70% confidence) orclaim_defective(30%)? If the LLM maps "used" toclaim_defectiveinstead ofclaim_damaged, the entire workflow routes to technical support instead of replacements. Worse, adversarial users may exploit this ambiguity. - Control Flow. Deterministic routing is a crisp function $(s,\sigma) \mapsto s'$. Given a state and input, there's exactly one next state. When an LLM controls routing, it becomes probabilistic. From the same $(s,\sigma)$, execution can diverge based on how the model interprets context and intent. And the model may route in a way that violates your system's assumptions. For instance, an LLM may receive a damage claim and simultaneously route it to the replacement team, quality assurance, and executive escalation because it "sensed urgency" in the customer's tone. It might retry failed operations repeatedly with minor variations.
- Output. The FSM's output function $\lambda(s, \sigma)$ produces a symbol from the output alphabet—perhaps
refund_approvedorclaim_denied. But the LLM must verbalize this in natural language. When verbalizing, it can introduce promises the system never made or reveal information it shouldn't. An LLM told to "send a refund confirmation" might compose an email promising the refund will arrive tomorrow (when processing takes 3-5 days), include a 20% discount code (unauthorized), and apologize on behalf of the CEO (inappropriate).
Anti-Patterns
These points lead to two common anti-patterns. First, LLM guardrails. We take guardrails to mean something that prevents bad things from happening. But LLM guardrails are probabilistic and depend on distributional assumptions. Second, schema validation. Getting output in the right format is not the same as getting the right information.
Making LLMs Less Risky in an FSM
The goal is not to make the LLM deterministic—that is impossible—but to manage its impact. Effective patterns include:
- Keep stochastic components off commit edges. Use two-phase commit with some verification step using HITL for expensive actions like issuing the refund.
- Don't use LLMs where you don't need to. For instance, LLM may invent new reasons or promises when explaining a decision. Instead, render reasons from the decision artifact (facts, rules fired) using a template: $\lambda'(s, \sigma) = \text{Template}(\text{facts}, \text{rules})$.
- Blast radius constraints.
- Spending Limits. Add constraints like $\phi$ (e.g., "total refunded ≤ total paid") with a runtime checker. If $\phi$ would be violated, block the transition regardless of what the agent proposes.
- Monitoring, Rollback, Release
- Measure money. Measure refund leakage, inappropriate denials, out-of-policy approvals, avoidable escalations, etc.
- Kill-switches and selective disablement. You need the ability to turn off specific model behaviors without shutting down the entire system. That means you can disable a tool call, remove a branch, bypass the model on certain states, or route an entire class of cases to deterministic logic or HITL.
- Controlled, intelligent release. LLM-driven components should never go from zero to 100%. Release changes gradually: start with low-stakes states, narrow slices of traffic, or non-financial actions. Expand only after observing the dollar impact and analyzing data.
Task Granularity: The Goldilocks Problem
Breaking a workflow into LLM tasks involves a fundamental tradeoff. Large tasks cause reasoning failures. Small tasks cause orchestration failures. There's an optimal middle ground—but finding it is more subtle than it first appears.
For total work $L$ divided into steps of size $s$, we get $n = L/s$ steps. The simplest model assumes per-step error has two components:
$$ e(s) = c \cdot s^{\alpha} + \varepsilon_0, \quad \alpha > 1 $$
Where:
- $c \cdot s^{\alpha}$: complexity-driven error that grows superlinearly with task size
- $\varepsilon_0$: fixed overhead error per step (parsing, tool calls, handoffs)
Total error across the workflow:
$$ \text{Total Error} = \frac{L}{s} \cdot (c \cdot s^{\alpha} + \varepsilon_0) = L \left( c \cdot s^{\alpha - 1} + \frac{\varepsilon_0}{s} \right) $$
To minimize total error, we differentiate with respect to $s$ and set to zero:
$$ \frac{d}{ds}\left[ c \cdot s^{\alpha - 1} + \frac{\varepsilon_0}{s} \right] = c(\alpha - 1) s^{\alpha - 2} - \frac{\varepsilon_0}{s^2} = 0 $$
Solving yields:
$$ s_{\text{simple}}^* = \left( \frac{\varepsilon_0}{c(\alpha - 1)} \right)^{1/\alpha} $$
The formula suggests:
- Better models (lower $c$) → larger optimal steps
- Better tooling (lower $\varepsilon_0$) → smaller optimal steps viable
The simple model assumes overhead $\varepsilon_0$ is fixed regardless of what's being passed between steps. But error could be multiplicative with loss in context proving damaging. A more accurate model accounts for context-dependent overhead:
$$ e(s) = c \cdot s^{\alpha} + \varepsilon_{\text{fixed}} + \varepsilon_{\text{context}} \cdot s^{\beta}, \quad 0 < \beta < 1 $$
Where:
- $\varepsilon_{\text{fixed}}$: truly fixed overhead (API calls, parsing)
- $\varepsilon_{\text{context}} \cdot s^{\beta}$: context serialization overhead that grows with complexity
The choice $\beta < 1$ reflects that context overhead grows slower than reasoning complexity—passing complex context is hard, but not as hard as reasoning about it.
Total error becomes:
$$ \text{Total Error} = L \left( c \cdot s^{\alpha - 1} + \frac{\varepsilon_{\text{fixed}}}{s} + \varepsilon_{\text{context}} \cdot s^{\beta - 1} \right) $$
Taking the derivative and setting to zero:
$$ c(\alpha - 1) s^{\alpha - 2} - \frac{\varepsilon_{\text{fixed}}}{s^2} + \varepsilon_{\text{context}}(\beta - 1) s^{\beta - 2} = 0 $$
Multiplying by $s^2$ gives the optimality condition:
$$ \boxed{c(\alpha - 1) s^{\alpha} + \varepsilon_{\text{context}}(\beta - 1) s^{\beta} = \varepsilon_{\text{fixed}}} $$
Since $\beta < 1$, the term $\varepsilon_{\text{context}}(\beta - 1) s^{\beta}$ is negative. This means at the simple model's optimum $s_{\text{simple}}^*$, the left side is less than $\varepsilon_{\text{fixed}}$. Therefore:
$$ s^* > s_{\text{simple}}^* $$
Context overhead pushes the optimal step size larger than the simple model suggests. Unlike the simple model, we must solve the optimality condition numerically and must calibrate.
Other Operational Realities
- Models drift and change. Temperature $0$ is not a proof of determinism; vendor updates can shift behavior. Track versioned prompts and models; use canaries and rollback.
- Errors hide in plain sight. Long stretches of apparent success conceal catastrophic outliers. Track $\sum_i p_i L_i$ and tail metrics (e.g., $\text{CVaR}_\alpha$) at the business level.
- Adversaries adapt. People learn "magic phrases" that bypass heuristics or trigger generous interpretations. Assume red‑teaming is continuous.
- The integration tax is real. Monitoring, fallbacks, rate limits, and escape hatches add complexity. Budget for it explicitly.
Engineering with Uncertainty
Moving to LLMs shifts the entire validation strategy. Deterministic components can be proven safe with unit tests: given the same state and input, they behave the same way every time. LLMs don’t offer that. They inject variance, and variance doesn’t break in predictable places. So correctness becomes a matter of risk budgeting rather than test coverage. Keep the model away from high-impact actions, let deterministic checks own the final say on anything expensive, and monitor behavior as if it were a live system rather than a piece of code. In the end, what matters is not how often the model parses JSON correctly but how much value leaks through policy violations, how much friction customers experience, and whether the workflow stays inside the tolerances the business can afford.