Testing LLMs: Engineering Confidence Without Certainty

Testing LLMs: Engineering Confidence Without Certainty
Photo by Maskmedicare Shop / Unsplash

Software engineers have long leaned on determinism for confidence. Given a function and a specification, we wrote unit tests, fixed the edge cases those tests revealed, and expected tomorrow to look like today. That was never fully true. Classical systems also depend on assumptions about their environment. A ranking function such as BM25 can drift as content and user behavior change. Heuristics degrade when traffic mixes evolve. Data pipelines wobble when upstream schemas or partner APIs shift. The old playbook worked best when the world stayed close to the distribution we implicitly assumed.

Large language model applications surface the same fragility and add two structural challenges. First, non-determinism: the same input can yield different outputs. Second, unbounded inputs: natural language opens an effectively infinite input space, across tasks, languages, lengths, and intentions. Distribution shift threads through both. When inputs, models, or data change, yesterday's evidence weakens.

There is also a third challenge that matters in practice. The ability to "fix" errors is limited because the core model largely sits outside the usual edit-compile-test loop. One cannot patch a function and be done. There are levers but often no precise patches.

To deploy ML systems, we need to move from a brittle notion of correctness to engineered bounds on risk. That shift rests on three pillars: constrain what the system is allowed to do, gather probabilistic evidence over representative scenarios, and analyze the space of fixes with a realistic view of what each lever can and cannot achieve.

What the two core changes imply

Non-determinism. Reproducibility weakens, and a single golden output is often the wrong oracle. Two consequences follow. First, evaluation benefits from properties that any acceptable answer should satisfy, rather than from one target string. Second, decision procedures can incorporate aggregation and checks: for example, agreement among multiple samples, or agreement between an answer and evidence that was retrieved for it. These mechanisms do not remove randomness; they use it to expose instability.

Unbounded inputs. Enumeration is impossible. Coverage becomes a statement about scenarios rather than about individual examples. A scenario is defined along axes such as task type, domain, language, input length, retrieval quality, and adversarial pressure. Evidence then concerns how well the system behaved in each scenario and how often real traffic occupies those scenarios. When the mix shifts, the original claims degrade in a known way, not an invisible way.

Both properties explain the kinds of drift that appear in production. The response is not "more tests" in the abstract. The response is to declare scope precisely and to detect when traffic leaves that scope.

Here's a revised version with stronger theoretical structure:

Constraints and properties: what counts as correct

When correct outputs cannot be enumerated for unbounded inputs, correctness shifts from point specifications to property classes. This requires a compositional approach where different property types guard against different failure modes.

The specification hierarchy. Properties form a natural hierarchy from syntactic to semantic:

  1. Structural constraints (contracts). These enforce output well-formedness independent of input content. Schema validation, type constraints, and budget limits provide the narrowest but most reliable bounds. They catch malformed JSON, type violations, and resource abuse, but say nothing about semantic correctness.
  2. Invariance properties (metamorphic relations). These specify how outputs should change—or not change—under input transformations. Key classes include:These properties test semantic stability without requiring ground truth for every input.
    • Equivalence invariants: Paraphrases should yield identical classifications
    • Monotonicity invariants: Adding gold evidence should not degrade answer quality
    • Consistency invariants: Formatting changes should not alter semantic outputs
  3. Robustness properties (placebo/ablation tests). These verify that the model ignores irrelevant information and attends to relevant information:These properties directly probe for spurious correlations and over-reliance on superficial cues.
    • Placebo resistance: Adding distractor passages should not change answers
    • Sufficiency tests: Removing irrelevant context should preserve correct outputs
    • Necessity tests: Removing critical context should trigger abstention

You're absolutely right - those last three paragraphs about compositional coverage belong in the "Constraints and properties" section, not here. Here's a clean version of "Evidence under uncertainty" that focuses solely on the statistical framework:

Evidence under uncertainty: probabilistic claims and scenario coverage

With stochastic outputs, acceptance becomes probabilistic. A typical claim has the form: for a given scenario, the failure rate is at most $p$ with confidence $(1-\alpha)$. The required sample size follows from standard bounds. For instance, with zero observed failures, the so-called rule of three gives an approximate 95 percent upper bound of $3/N$. Targeting 1 percent at 95 percent confidence suggests on the order of 300 independent trials. Tighter targets require more samples. When failures are observed, interval estimates such as Wilson or Clopper–Pearson communicate uncertainty more honestly than point estimates.

Independence and multiple comparisons. Two statistical issues threaten validity:

  1. Dependence between trials inflates effective sample size. Near-duplicate inputs, temporal correlation, and shared random seeds violate independence assumptions. Mitigate by varying prompts, seeds, time windows, and users.
  2. Multiple testing degrades system-level confidence. Testing $m$ scenarios at individual confidence $(1-\alpha)$ yields system confidence of at most $(1-\alpha)^m$. Apply Bonferroni correction, false discovery rate control, or hierarchical testing to maintain overall error rates.

Scenario coverage through grids. Since exhaustive testing is impossible, partition the input space into scenarios along interpretable axes: task type, domain, language, input length, retrieval quality, and adversarial pressure. This scenario grid serves three purposes:

  1. Targeted sampling: Populate each cell with diverse test cases including paraphrases and adversarial variants
  2. Production alignment: Track what fraction of real traffic maps to each scenario
  3. Drift detection: Monitor when traffic distribution shifts across scenarios

Maintain a repository of production incidents as ground truth. Include red team inputs proportional to their expected occurrence: jailbreaks, prompt injections, confusable characters, tool-abuse patterns, and long-context stressors.

Quantifying and pricing risk

Risk is the product of frequency and cost. The analytical move is to separate routine loss from tail risk and to make both explicit.

  1. Taxonomy and costs. A failure mode can be described by severity, detectability, and expected remediation. Examples include unsupported factual claims that slip past verifiers, harmful or non-compliant content, correct-looking but wrong summaries, tool misuse by an agent, and excessive abstention that degrades experience. Catastrophic modes invite strong constraints such as human approval, reduced privileges, deterministic decision rules, or prohibition.
  2. Targets from tolerance. If a failure mode has average cost $C$ and acceptable expected loss over a period is $L$, then a target failure probability for that period is $L/C$. The evaluation plan then seeks to show, per scenario, that the upper bound on the failure rate is below that target at a chosen confidence level.
  3. Tail constraints. Expected loss is appropriate for routine error. Rare, high-severity events are better handled by hard constraints and fences: scoped credentials, limited budgets, and multi-person approval for irreversible actions.
  4. Abstention as a priced action. Abstaining is neither free nor a last resort. If $C_{\text{error}}$ is the cost of a wrong answer and $C_{\text{abstain}}$ is the cost of declining to answer, then a confidence threshold can be chosen to minimize $p_{\text{error}}(\tau) \cdot C_{\text{error}} + p_{\text{abstain}}(\tau) \cdot C_{\text{abstain}}$. The optimal threshold depends on the scenario mix and should be revisited as that mix changes.
  5. Confidence and calibration. Raw model "confidence" is not reliably calibrated. Practical proxies include agreement across multiple samples, agreement with an independent verifier, the presence and quality of grounded citations, entropy where available, and similarity between retrieved evidence and the answer. A simple calibrator, such as logistic or isotonic regression, can map these signals to empirical probabilities within each scenario. Post-deployment calibration curves then indicate whether the system's probability statements match observed frequencies.

For high-risk features, some teams adopt a safety case: a short argument that states the claims, links them to evidence, and lists the controls and playbooks that apply in production. The value is not ceremony but clarity about what is being claimed, about where the evidence lives, and about how the system behaves when assumptions fail.

Here's a combined version maintaining the essay style:

Intervention strategies when the model is largely immutable

The immutability of the base model forces a fundamental question: where in the system can we intervene to address different error types? The answer depends first on understanding the causal mechanisms behind failures, then mapping those mechanisms to appropriate intervention points.

Errors in LLM systems stem from distinct causes. Some are merely structural—correct content in wrong formats, malformed JSON, inconsistent schemas. Others reflect missing context: the model lacks information that exists but wasn't retrieved or was poorly framed. A third class involves task misspecification where the model misunderstands what's being asked. Deeper still are knowledge errors where the model's parametric knowledge is wrong or absent, reasoning errors where it makes invalid inferences despite having correct information, and calibration errors where confidence systematically misaligns with accuracy. Finally, alignment errors violate intended policies or safety constraints.

These error classes map naturally to a hierarchy of intervention points, with a key trade-off: interventions closer to the output are safer but less powerful, while those closer to the model are more powerful but riskier. At the surface, output constraints (schemas, grammars) can enforce structural correctness but cannot touch semantic errors. Moving inward, context manipulation (retrieval, prompts) addresses information gaps and task misunderstanding, but these fixes prove brittle—what works for today's prompt distribution may fail tomorrow. External validators can detect and reject any error type they're programmed to recognize, but they cannot generate correct answers, only identify wrong ones. At the core, model adaptation (fine-tuning, RLHF) can shift fundamental behavior, but with risks of regression, reward hacking, and lost calibration.

This structure suggests a selection principle: use the minimal intervention that can address the error's causal mechanism. Format violations need only output constraints. Hallucinations in the presence of available evidence point to retrieval or verification fixes. Systematic overconfidence may require fine-tuning with calibration data. The principle is conservative by design—deeper interventions carry compound risks that accumulate across the system.

The practical tool is a fixability matrix mapping error types to intervention points, scored by effectiveness, cost, and risk. This matrix reveals an uncomfortable truth: many errors have no clean fix at any layer. The immutability constraint forces a separation between detecting errors (possible through validation for any error type) and correcting them (requires matching intervention to cause). We're not debugging in the traditional sense but building compensating controls around a black box we cannot directly modify.

What emerges is a pragmatic philosophy: treat the model as a powerful but flawed component, surround it with the minimal machinery needed to bound its failures, and maintain clear documentation about which errors can be fixed, which can be detected, and which must be accepted as inherent limitations.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe