Not so Prompt: Prompt Optimization as Model Selection
Here's a framework for prompt optimization:
Defining Success: Metrics and Evaluation Criteria
Before collecting any data, establish what success looks like for your specific use case. Choose a primary metric that directly reflects business value—accuracy for classification, F1 for imbalanced datasets, BLEU/ROUGE for generation tasks, or custom domain-specific measures like "percentage of correctly extracted invoice fields" or "customer issue resolution rate." This primary metric drives optimization decisions.
Alongside your primary metric, define auxiliary constraints that you won't compromise on. These include output format compliance (does the JSON parse?), latency requirements (under 2 seconds per request), cost bounds ($0.01 per query), and safety requirements (no PII leakage, no harmful content). Treat these as pass/fail gates rather than metrics to optimize.
If you're using LLM judges for evaluation—common for subjective tasks like writing quality or helpfulness—implement careful controls. Randomize the order of responses being compared, normalize for length biases, use structured rubrics rather than open-ended judgments, and periodically validate against human evaluation. Remember that LLM judges can be gamed, so never use them as the sole evaluation method for high-stakes deployments.
Data for Evaluation
Once metrics are defined, determine how much data you need for statistically valid comparisons. For detecting a three percentage point improvement with 95% confidence, you'll need approximately 1,000 labeled examples. For a 5 percentage point precision, around 400 examples suffice.
Random sampling is critical—your evaluation data must represent the true distribution of inputs your system will face in production. Stratified sampling is cheap to do and always improves the standard errors. Split your data thoughtfully: for small datasets (<1k examples), use K-fold cross-validation on combined train and dev sets, reserving a single test set for final evaluation. For larger datasets, use traditional train/dev/test splits, but consider a reusable holdout set if you'll be making many optimization iterations. The cardinal rule: one final test evaluation only, no exceptions.
Structured Search Space
Rather than treating prompts as monolithic blocks of text, decompose them into modular components that can be independently modified and recombined:
- Instruction: The core task description
- Constraints: Guardrails and requirements
- Reasoning: Chain-of-thought scaffolding or step-by-step guidance
- Schema: Output format specifications
- Demonstrations: Few-shot examples
Define bounded edit operators that modify these components systematically: rephrasing instructions for clarity, adding or removing constraints, reordering reasoning steps, swapping demonstration examples. This decomposition transforms the infinite space of possible prompts into a tractable search problem.
Candidate Generation Methods
Options include:
- Meta-prompting (OPRO, Yang et al. 2023) uses an LLM to propose new prompts given the history of previous attempts and their scores. It's simple to implement but can get stuck in local optima without careful temperature control and diversity mechanisms.
- Evolutionary search maintains a population of prompts that evolve through mutation (single edits) and crossover (combining components from different prompts). This population-based approach naturally explores more of the search space than single-trajectory methods and reliably avoids premature convergence.
- Failure-aware refinement specifically targets weaknesses in the current best prompt. After each round, mine the failures, generate counter-examples and new constraints, then require the next iteration to address these specific cases. This approach treats each failure as a unit test that future candidates must pass.
- RL-based optimization trains a small policy network to propose edits, using task performance as the reward signal. However, this approach suffers from high variance and should only be considered when simpler methods have plateaued.
Efficient Evaluation Strategy
Before expensive evaluation, apply diversity filters to avoid wasting compute on near-duplicates. Reject candidates with >85% edit distance similarity to existing prompts or >95% embedding cosine similarity. These simple filters prevent population collapse without complex novelty mechanisms.
During evaluation, use racing algorithms (Jamieson & Talwalkar, 2016) to identify winners efficiently. Evaluate candidates in blocks, progressively pruning those whose upper confidence bound cannot beat the incumbent's lower bound. This approach can reduce evaluation costs by 50-80% compared to exhaustive testing.
Constraints and Governance
Certain requirements are non-negotiable and should be enforced as hard constraints rather than soft objectives:
- Format compliance: Outputs must parse correctly
- Latency/cost bounds: Stay within operational budgets
- Safety requirements: No harmful or biased outputs
- Honesty: No claiming capabilities that the system doesn't have
Before any prompt reaches production, conduct a human audit focusing on these constraints, plus maintainability. This isn't about second-guessing the metrics but catching failure modes that automated evaluation might miss.