By Gaurav in ML/Statistics — 12 Aug 2024

Not so Prompt: Prompt Optimization as Model Selection

Prompt optimization is a budgeted black-box optimization problem. It is made worse by the fact that: 1) the mapping from prompt to performance is opaque, and 2) measurements are noisy because of sampling (which rows you draw), decoding randomness (temperature/top-p), retrieval/context variability, serving load, and judge variability. Under these conditions, progress comes from careful definition of the search space, sample-efficient exploration, and evaluation that reflects reality.

Objective and Constraints

With multiple metrics, each decision can feel like defending an arbitrary tradeoff (is 1.1ms worth 0.2 percentage points?). This can lead to decision paralysis and endless parameter tuning. There are two clean ways out. If you can price everything credibly (often only possible after release, e.g., to price out the cost of latency for product use) collapse to a single monetary objective. Another way forward is constrained optimization: use a single primary metric (precision@K for retrieval, field-level accuracy for extraction) with hard boundaries on secondary factors (latency < X ms, cost < $Y, valid JSON, etc.).

A crucial refinement is to express those constraints statistically and measure them under production-like load. A prompt tuned with cached responses and unlimited time may look great offline but fail under real conditions because its optimization surface was different. End-to-end evaluation—in the actual serving environment with the same retrieval stack, tokenization, decoding parameters, and concurrency limits—ensures that measured gains survive deployment.

Hence, instead of “latency < X ms,” maintain an upper confidence bound on the P95/P99 measured under production‑like load and require that bound to sit under your limit; instead of “violation rate < ε,” maintain an upper bound on that rate and only compare quality among prompts whose bounds sit inside the safe region. Count both prompt and output tokens because cost scales with total usage, not just input length. And calibrate any LLM judge and propagate its noise into your intervals. Like with the secondary metrics, we want a large random sample of real (or at least realistic) data.

Evaluation Design

To ensure generalizability, split the development pool into K stratified folds (respecting groups and time where needed). Keep paired comparisons within each fold to cut variance. And prevent leakage: items used as demonstrations never appear in evaluation folds; repaired failures move to a regression slice, so they aren’t counted as “unseen” again.

Candidate Generation

There are two tricks:

Structured Prompts

Treating prompts as monolithic strings creates an exploration problem: each edit potentially affects multiple behaviors simultaneously, making it impossible to isolate which changes drive improvement. Decomposing the prompt into components such as task instruction, constraints, reasoning scaffolding, output schema, demonstrations, etc., transforms the search from unstructured text editing into combinatorial optimization where:

Attribution becomes tractable. When a variant combining instruction A, constraint set B, and demos C outperforms the baseline, you can run ablations to identify whether the gain came from clearer instructions, tighter constraints, or better examples. This knowledge accumulates: winning instructions can be adapted to related tasks, proven constraints can become standard guardrails.
Search efficiency improves through factorization. With 5 components each having 4 variants, structured search explores 4^5 = 1,024 combinations, while unstructured rewrites typically generate dozens of variations that differ in unpredictable ways. The structured space provides better coverage per evaluation because each test point represents a deliberate hypothesis about component interactions.

The trade-offs are real. Components can interact non-linearly: constraints can nullify the value of demonstrations when they directly encode the same behavior; reasoning scaffolds can reduce schema compliance when they encourage verbose exploration. Over-modularization can also fragment natural language flow and multiply maintenance points. Ablations (remove one component, swap another) and selective coupled edits (jointly modify instruction+schema when they reference shared concepts) can help validate whether structural complexity delivers commensurate gains.

Learning from Failures

Progress depends on turning failures into structured hypotheses. Collect examples where the model fails, study them for recurring patterns, and organize them into a small, mutually exclusive and collectively exhaustive taxonomy—a MECE map of what can go wrong. Each category should point to a plausible cause and a corresponding repair: a schema hint to prevent misformatted outputs, a counterexample to steady a brittle instruction, or a clarifying phrase to resolve ambiguity.

For each repair, make the hypothesis explicit: “This edit fixes this error type and should improve examples of kind X without hurting Y.” Then test it empirically. Compare paired prompts—identical except for the edit—on the subset of examples that expose the issue. Because evaluations are noisy, treat outcomes probabilistically: a fix that consistently trends upward across replications is more trustworthy than a single strong result. Retain only those repairs that show dependable gains without collateral regressions.

To guard against overfitting, every proposed fix must clear four gates before adoption: (1) a clear lift on the slice that revealed the failure, (2) no regressions elsewhere, (3) robustness to light perturbations (synonym swaps, order changes, small temperature shifts), and (4) a leakage check—swap the demonstrations for equivalent ones and confirm the improvement holds.

Repairs that pass become part of the prompt. The once-problematic examples join a permanent regression slice, ensuring no old failure returns. Over time, that slice becomes a living test suite—a memory of past errors that keeps progress cumulative rather than circular.

Exploration Under Budget

With evaluation budget B (queries × tokens × human time), the allocation problem is: given N candidates and noisy measurements, how do you distribute trials to maximize information gain about which variant performs best?

Racing (successive halving/Hyperband) exploits the typical power-law distribution of candidate quality. Since most variants underperform the baseline, investing equally in all candidates wastes budget on obvious failures. By evaluating all candidates on a small shared sample, then pruning those whose confidence intervals can't overlap with the leader's lower bound, racing redirects saved budget toward discriminating among competitive variants. This strategy excels when initial candidate pools are large (>20) and performance distributions are heavy-tailed.

Bandits (UCB/Thompson) solve the exploration-exploitation trade-off when choosing where to allocate the next batch among survivors. By maintaining posterior distributions over each candidate's performance (Beta for binary outcomes, Gaussian for continuous scores), bandit algorithms allocate trials where uncertainty × potential gain is highest. Thompson sampling adds stochasticity that prevents premature convergence, while UCB provides worst-case guarantees. For pairwise preference data, dueling bandits with Bradley-Terry models handle intransitive preferences better than scalar scoring.

Surrogate-guided proposals (Bayesian optimization/TPE/BOHB) become valuable when candidate features predict performance. By fitting a regression model (Gaussian process, random forest) mapping prompt features to observed scores, surrogates suggest new candidates likely to exceed the current best while maintaining diversity through acquisition functions like Expected Improvement. This works when structural patterns exist (e.g., "constraints mentioning safety improve robustness") but fails when performance is idiosyncratic to exact wording.

No single method dominates because each exploits different problem structures. Racing leverages heavy-tailed distributions, bandits balance exploration with exploitation, surrogates exploit feature-performance correlations, and failure mining targets specific weaknesses. Combining them—racing for initial filtering, bandits for refined allocation, surrogates for directed search, failure mining for robustness—captures their complementary strengths.

A Pragmatic Optimization Loop

The following routine balances theoretical efficiency with practical constraints:

Feasibility screen. Generate diverse candidates through structured edits. Eliminate variants failing hard constraints (unparseable output, safety violations, excessive latency) before expensive evaluation. Cheap deduplication via embedding similarity prevents evaluating near-clones that provide redundant information.
Racing pass. Evaluate all feasible candidates on a small shared subset. Prune those whose upper confidence bounds fall below the incumbent's lower bound. This early filtering prevents budget waste on clear underperformers.
Adaptive allocation. Among survivors, use Thompson sampling or UCB to allocate additional examples based on uncertainty and potential improvement. Stratified sampling maintains coverage across data slices. Continuously prune candidates whose credible intervals separate from the leader.
Surrogate-guided proposal. Fit a lightweight model on component features and accumulated outcomes. Generate high-expected-improvement candidates that maintain diversity through explicit diversity penalties or batch selection methods.
Failure mining. Extract patterns from error cases, convert to explicit constraints or counter-examples, and require future candidates to handle them. This transforms the long tail of errors into regression tests.
Final validation. When marginal improvements fall below practical significance thresholds, run the held-out test set once. Avoid repeatedly testing against the same held-out data, which converts it into a training set through researcher degrees of freedom.

This isn't theoretically optimal—true optimality would use value-of-information calculations with knowledge gradients or Gittins indices. But it captures most available gains through a combination of breadth (racing), depth (bandits), learning (surrogates), and robustness (failure mining) while remaining simple enough to implement correctly. The explicit budget framing forces confrontation with the fundamental trade-off: more candidates explored shallowly versus fewer candidates evaluated deeply. Making this trade-off conscious rather than accidental is half the battle.

Objective and Constraints

Evaluation Design

Candidate Generation

Exploration Under Budget

A Pragmatic Optimization Loop

Subscribe to Gojiberries