By Gaurav in ML/Statistics — 27 Sep 2025

What Problem is DSPy Solving With What Assumptions

Follow up to: Not so Prompt: Prompt Optimization as Model Selection

The practical problem is to choose a prompt that performs best on a task while respecting real‑world constraints: the output must parse; latency and per‑request cost must stay within budget; safety rules must hold. That decision must also be made with a fixed optimization budget—some cap on how many prompt×example evaluations we can afford. Under those conditions, the task looks like fixed‑budget best‑arm identification: there is a set (or stream) of candidates; each evaluation is noisy; and we must pick the arm (prompt) with the highest expected payoff without overspending the budget.

Any serious solution has to address three things at once: how the search space is represented, how candidates are proposed, and how the limited evaluation budget is allocated under noise so that the final choice is reliable in deployment, not just in a lab notebook.

How DSPy Thinks About the Problem

DSPy begins with a strong programming model. You define modules with typed inputs and outputs (Signatures). You compose modules into pipelines. You get built‑in facilities to construct prompts, parse structured outputs, and keep configurations under version control. For optimization, DSPy provides components that generate instruction variants and curate few‑shot examples (e.g., bootstrapping from previous successes), and a Bayesian‑optimization‑style search (e.g., MIPROv2) to propose promising combinations. The system encourages iterative refinement: propose, evaluate on a metric, keep what helps.

The implied mental model behind this design is recognizable:

Prompts behave a bit like hyperparameters—small changes often produce small performance shifts, and “good” regions are common enough that broad sampling plus a reasonable surrogate will find them.
Components compose mostly independently—better instructions and better demonstrations tend to add up.
A uniform or loosely controlled allocation of evaluation effort is acceptable—the main work is to propose good candidates; the way we compare them matters less.

When these assumptions hold, DSPy’s approach is convenient and productive. The abstractions reduce boilerplate, the proposer often produces reasonable variants, and the pipeline is easy to reproduce and share.

Where the Fit Is Good

For tasks with wide basins of “good enough” prompts—summaries of generic text, light question‑answering, routine classification—LLM‑generated paraphrases of instructions and straightforward example selection do produce steady gains. Typed I/O and parsing helpers prevent many basic errors. The Bayesian proposer can recycle signals from earlier trials to avoid obviously weak regions of the space. In short: when the landscape is forgiving and budgets are comfortable, the DSPy workflow keeps you moving.

Where the Fit Breaks and Why

The cracks appear when we take the fixed‑budget, deployment‑faithful view seriously.

Budget isn’t first‑class. In fixed‑budget best‑arm identification, allocation is half the problem. Later rounds should get more depth on fewer arms because distinguishing “good” from “great” costs more samples than rejecting the obviously bad. DSPy, as commonly used, focuses on proposing candidates and tends to evaluate them in roughly uniform ways. That’s comfortable, but under a hard cap it leaves accuracy on the table because it spends too much on early losers and too little on close finalists.

Constraints become weights. Latency, per‑request cost, and safety are often folded into a composite score (“accuracy × α + 1/latency × β”). Weighted scores are easy to compute, but they entangle a product decision (would we accept a bit less quality for lower latency?) with offline model selection. In practice, a simple gate—“must parse, must stay under budget, must be safe”—prevents entire classes of offline wins that fail the moment traffic spikes.

Variance isn’t actively controlled. Under noise, how you evaluate matters. Paired comparisons (competing prompts run on the same examples in the same round), stratified batches that mirror production, fixed decoding and retrieval settings, and reporting P95/P99 latency—all of these reduce variance at the same budget. DSPy leaves most of this to user discipline. Without it, you need more samples to trust small differences, or you risk selecting a noisy winner that won’t hold up.

The prompt isn’t fully structured. DSPy’s optimizers treat instruction text and demonstrations as the main knobs. In practice, other components carry as much weight: explicit constraints (“return valid JSON matching this schema”), reasoning scaffolds (“check each field before emitting”), and fallback behavior. When those parts are first‑class, the search space can be explored more systematically; when they’re fused into one blob of text, each change affects several behaviors at once, and the learning signal is muddy.

Generation drifts toward paraphrase space. Letting an LLM propose instruction variants pulls exploration toward “natural” wordings. That’s fine when the natural style overlaps the optimum, but many robust wins are a bit unnatural (explicit schemas, checklists, terse enumerations). Without novelty checks and explicit component toggles, proposals can converge prematurely on a narrow style and burn budget on near‑duplicates.

Governance is an afterthought. A sealed holdout for the final decision, simple novelty filters (to avoid paying twice for look‑alikes), and pairwise judging with dueling‑bandit selection when preferences—not scalar scores—drive evaluation: these aren’t central features. They matter when decisions are high‑stakes or budgets are tight.

None of these issues imply DSPy is “wrong.” They indicate a mismatch between its hyperparameter‑like mental model and the realities of program‑like prompts under a fixed budget.

An Alternative Framing—and How It Changes the Design

If we treat the problem as budgeted best‑arm identification under constraints, several design choices follow naturally.

Start with a structured prompt: separate instruction, constraints, reasoning scaffold, schema, and demonstrations. Hold schema and safety blocks fixed across most candidates so feasibility failures are rare. Generate diversity by bounded edits to the other parts and allow occasional coupled edits (e.g., instruction+schema together) to capture interactions. This makes gains attributable and lets you reuse pieces that work.

Allocate evaluation with a racing schedule (successive halving or a harmonic successive‑rejects variant). Early rounds skim broadly on small, paired, stratified batches; each later round gives more depth to fewer arms. Between rounds, add a handful of surrogate‑guided candidates if features correlate with outcomes; otherwise, stay with structured edits. Throughout, apply constraint gates first (parseability, token budgets, safety, gross latency), so you never invest depth in infeasible arms.

When finalists are close, bandit‑style allocation among survivors (e.g., Thompson sampling or LUCB rules) directs the next evaluations where uncertainty can change the decision. If evaluations are pairwise (LLM‑as‑judge or human preferences), use a dueling‑bandit model rather than forcing preferences into a scalar.

Finally, mine failures and convert them into counter‑examples or explicit constraints. That turns one‑off surprises into regression tests and improves transfer to production.

The point of this framing isn’t to worship a particular algorithm; it is to commit to two budgets explicitly—serving and optimization—and to make the search, evaluation, and governance choices that preserve them.

What This Means for DSPy

DSPy’s programming model is a good base. The critique is about defaults and missing affordances, not its core idea.

Give budget a seat at the table: a selector that accepts a strict evaluation cap and implements a clear allocation policy (halving/rejects, then survivor bandits).
Treat constraints as gates, not weights: simple pass/fail checks for parsing, token budgets (prompt and output), P95/P99 latency, and safety flags, callable before deeper evaluation.
Bake in variance control: paired batches per round, stratified sampling plans, fixed decoding and retrieval settings, interval estimates by default.
Make the full prompt structure first‑class: constraints and reasoning scaffolds alongside instructions and demonstrations, with small utilities for ablations and coupled edits.
Add light governance: novelty filters, pairwise judges with dueling‑bandit selection, and a sealed holdout pattern for the once‑only final check.

Adopting these changes doesn’t fight DSPy’s design; it sharpens it. You keep the ergonomics that make teams productive, while aligning the optimization layer with the realities of fixed budgets and production constraints.

How DSPy Thinks About the Problem

Where the Fit Is Good

Where the Fit Breaks and Why

An Alternative Framing—and How It Changes the Design

What This Means for DSPy

Subscribe to Gojiberries