By Gaurav in ML/Statistics — 21 Jun 2025

Greedy is Good. Less Greedy May be Better.

Forward stepwise regression, agglomerative hierarchical clustering, and CART rely on a simple principle: make the best local choice at each step. Greedy choices can also be optimal when problems possess the greedy choice property, where globally optimal solutions can be reached through locally optimal decisions, e.g., minimum spanning trees, activity selection, Huffman coding, and Dijkstra’s algorithm with non-negative weights. We often accept greedy algorithms because they are fast and “good enough,” but “good enough” is rarely quantified. One way to formalize the decision is via a simple model.

Let $\mathcal{I}_t$ denote the information structure available at step $t$. The expected payoff from reducing greed is

$\mathbb{E}[\text{Value}_t \mid \mathcal{I}_t] \approx \text{Leverage}_t \times \text{Uncertainty}_t(\mathcal{I}_t) \times \text{Recoverability}_t^{-1}(\mathcal{I}_t) - \text{Cost}_t(\mathcal{I}_t)$

where:

Leverage: expected impact on final solution quality if the current choice is wrong (conditional on error)
Uncertainty: degree to which the ordering among competing options is unresolved given current evidence (e.g., misranking probability or rank instability under resampling)
Recoverability: ease of reversing or compensating for a wrong choice (higher means easier to undo)
Cost: effort or computation needed to evaluate and compare alternatives. The viability of less-greedy tactics depends on the cost of evaluating alternatives and estimating their impact. When the future impact of a choice can be cheaply approximated or bounded—via branch-and-bound, admissible heuristics, or efficient surrogates—more exhaustive search is affordable. Without such shortcuts, the cost of reducing greed may outweigh its benefit.

Information Structure

Different problems have different amounts of information signal. In continuous optimization, the decision state often includes strong signals—gradients, curvature, constraint structure—that enable effective improvements. In discrete greedy algorithms, the state is often defined only by noisy or discontinuous metrics, limiting the precision of refinements. Where informational richness is high, more sophisticated, less greedy strategies are likely to pay off. But GD can also leverage problem-specific information unavailable in other problems: gradients provide directional guidance, curvature information enables second-order methods, historical trajectories allow momentum-based smoothing, etc. This rich information landscape explains why GD improvements are highly developed compared to discrete algorithms like hierarchical clustering or stepwise regression, which have access only to distance metrics or statistical significance measures.

Heuristic

In many problems, e.g., forward stepwise regression, agglomerative clustering, decision trees, early decisions have the highest opportunity costs because they constrain all future choices. Recognizing when opportunity costs are highest can enable refinements that improve solution quality even conditional on computational budgets. For instance, adaptive beam sizes that start wide and narrow over time, more rigorous validation for early decisions in tree building, and investing extra computational resources in initialization procedures may all prove useful. (*In gradient descent, it is common to have a declining learning rate schedule because of the structure of the problem.)

Estimating the Terms in Practice

CART

Leverage: For a candidate split, grow the remainder of the tree greedily and evaluate out-of-fold risk; repeat for the top $k$ alternative splits at the same node. Leverage ≈ best-alternative risk improvement − chosen-split improvement
Uncertainty: Bootstrap the node's data; compute rank stability (e.g., Kendall–$\tau$) of the top-$k$ splits by out-of-fold gain. Low stability ⇒ high uncertainty
Recoverability: Estimate how often later nodes can emulate the foregone split (e.g., frequency with which a split or its strong surrogates reappear downstream). Higher surrogate correlation ⇒ higher recoverability
Cost: Wall-clock per additional split evaluation; presence of cheap proxies (histogram pre-binning, cached statistics, admissible bounds for pruning) lowers Cost

Problem-Specific Information Signals:

Margin $\Delta$ between the best and second-best split's out-of-fold gain (larger $\Delta$ ⇒ clearer guidance)
Consistency of impurity gradients across resamples (stable direction ⇒ clearer guidance)
Availability of credible side information (e.g., monotonic constraints, known feature orderings, or policy rules) that can prune candidates or shape scoring

SGD/continuous optimization (structure-rich)

Leverage: Proxied by local objective curvature and distance to stationarity; early in training, larger step-wise effects imply high leverage
Uncertainty: Mini-batch gradient variance vs. mean norm (gradient SNR); high variance implies greater risk of misdirected steps
Recoverability: With strong convexity or PL conditions, mistakes are easier to correct (higher recoverability); in ill-conditioned regions,side information they are harder
Cost: Extra backprops, Hessian-vector products, or line-search evaluations required by the refinement

Problem-Specific Information Signals: Concrete, problem-specific signals:

Curvature (Hessian spectrum/conditioning); well-conditioned curvature enables Newton/Quasi-Newton preconditioning
Trajectory history (momentum/Nesterov) that predicts future iterates and supports lookahead
Local smoothness/Lipschitz/PL constants that make line-search and trust-region steps informative

Given these principles, there are four broad ways to be strategically less greedy.

Lower evaluation costs (get more signal per unit compute; save on space)

Multi-fidelity evaluation (cheap proxies first).
What? Score candidates at low fidelity (subsampled data, fewer features, shorter horizons, early stopping, histogram split gains), then refine only the top tier.
Why? Cuts cost while preserving enough ranking signal to shrink uncertainty early.
Examples: GBDT uses histogram bins to approximate split gains, computing exact gains only for finalists; RL/planning uses short-horizon rollouts to screen actions; clustering screens centers in a 20-D PCA space before full-space refinement.
Surrogate modeling / Bayesian optimization (predict downstream value)
What? Learn a cheap model mapping candidate descriptors to predicted final objective; pick next evaluations via an acquisition rule (expected improvement, UCB, Thompson sampling).
Why? Replaces many expensive evaluations (keeps the cost down) and steers trials toward high expected payoff (leverage per evaluation up) using model uncertainty (uses uncertainty explicitly).
Examples: Predict a tree’s cross-validated loss from local split stats (gain, depth, counts) and test only top-predicted splits; tune GBM (learning rate, depth) with BO instead of grid.
Decomposition and caching (evaluate locally, reconcile globally) What? Work in blocks, with coresets or cached statistics, so each candidate evaluation is cheap.
Why? Lowers cost and can reduce estimator variance (Uncertainty down) via reuse.
Examples: Coordinate descent for Lasso; coresets for k-means/agglomerative clustering; histogram or quantile sketches cached per feature for tree splits.

Allocate compute where the payoff is high (respect opportunity cost)

Bandit-style allocation of compute (Successive Halving/Hyperband)
What? Start many candidates with tiny budgets; progressively promote the best to larger budgets using successive halving or Hyperband schedules.
Why? Implements opportunity cost directly: allocate compute where the product of stakes and likelihood of being right is largest. Heuristically, prioritize candidates with larger estimated $L_t × (1−π_t)$ per unit cost; halving discards low-payoff options early, concentrating resources where payoff is high under uncertainty. This both lowers Cost and focuses on high-leverage options.
Examples: For the top 50 CART root splits: 5% data -> keep 10 -> 20% -> keep 3 -> 100% -> commit. For k-means initializations: run five iterations on 30 starts, continue the best 5 to convergence. For hyperparameter tuning: allocate more epochs/steps only to configurations with strong early validation.
Curriculum / coarse-to-fine search
What? Explore broadly and shallowly when early steps have high leverage; progressively narrow and deepen as evidence accumulates.
Why? Mirrors the value model: high early leverage justifies wider beams and more checks; later uncertainty drops, so narrow and save cost.
Examples: Hyperparameter optimization: coarse log-grid then BO within the best region; clustering: build on coarse features or low-resolution data, then refine in full detail.

Increase recoverability / defer irreversible choices

Keep more options open. Instead of committing to the single best choice at each step, preserve multiple promising alternatives. Beam search does exactly this—maintaining the top-k solutions rather than just the top one. Preserving options allows us to delay some decisions till we have more information.
What? Track the top-k candidates rather than one; adapt k by step importance and rank stability.
Why? Raises effective Recoverability by postponing hard commits until Uncertainty falls; protects against early high-leverage mistakes.
Examples: Retain top-k root splits and grow shallow subtrees for each before committing; in clustering, maintain multiple centroid hypotheses for early iterations.
Lookahead with pruning before choosing (bounded foresight). Rather than optimizing for gains over the next step, evaluate how choices play out over multiple steps. This is like chess players who consider not just their next move, but several moves deep. Two strategies are often used to make the process more efficient: reusing solutions to overlapping subproblems and pruning unpromising paths early. The first is the foundation of dynamic programming, while the second is central to algorithms like branch and bound, which combine lookahead with smart elimination of bad options. Looking ahead can be especially valuable when evaluating early consequential decisions like initializations. See, for instance, Lookahead K-Means Initialization.
What? Evaluate multi-step consequences for a few layers using cached or bounded estimates; prune subtrees whose upper bounds are inferior.
Why? Converts potential high-leverage early choices into informed commits while containing Cost.
Examples: Branch-and-bound style tree growth with admissible bounds on validation gain; shallow simulation of alternative k-means assignments before updating centers.
Backtracking and local repair after commit
What? Allow reversible edits—prune or graft subtrees; re-split high-traffic nodes; swap medoids or reassign small point sets.
Why? Directly increases Recoverability; mitigates regret from early errors when full global search is infeasible.
Examples: CART with cost-complexity pruning and post-hoc grafting; CLARANS-style medoid swaps; floating forward-backward feature selection.

Reduce estimator uncertainty (make local scores more reliable)

Margin-based commit rules (defer commitment under ambiguity)
What? Only commit when the top candidate’s margin over the runner-up exceeds a threshold; otherwise, gather more evidence or keep multiple options.
Why? Converts Uncertainty into explicit action: defer or widen the beam when misranking risk is high; commit when risk is low.
Examples: In trees, require an out-of-fold gain margin at high-impact nodes (root/level-2); otherwise, evaluate more data or retain top-k splits.
Make better local decisions. Even within greedy frameworks, we can improve individual choices. Cross-validation replaces training error with more robust validation estimates. Regularization incorporates prior knowledge to avoid overfitting. Bagging averages across multiple random samples to reduce variance in each decision.
What? Replace in-sample scores with out-of-fold or sample-split estimates; use nested CV or one-standard-error rules when deltas are tiny.
Why? Lowers Uncertainty in local rankings, preventing miscommits at high-leverage steps.
Examples: Honest trees (separate split/estimate samples); AICc or cross-validated risk in stepwise.
Restarts and diversified ensembling (variance hedging). Run the greedy algorithm multiple times with different starting conditions, then ensemble the results. For example, see the work on Bayesian stacking here.
What? Run multiple starts under different seeds or subspaces and either pick the best-of-N or combine.
Why? Averages over outcome variance (Uncertainty down) and provides soft Recoverability by mixing paths.
Examples: Many k-means inits, pick best inertia; random-subspace forests; stacking stepwise models.

Other ideas

Targeted evidence acquisition (data-side intervention)
What? Collect or construct information where it matters (active labeling, targeted augmentation, engineered ratios, or monotone transforms).
Why? Improves the information structure at high-leverage steps, lowering Uncertainty and tightening leverage estimates; can enable stronger pruning (Cost down).
Examples: If two root splits trade places across bootstraps, label points in the feature bands that distinguish them; in vision, augment only classes confused near the current boundary.

Matrix Completion

To make deliberate progress, we need more than isolated tricks—we need a framework. Here, we started by coming up with strategies to be less greedy. If we follow that by listing the greedy methods: forward stepwise regression, k-means, CART, etc., we get a design space. From there, we can build a matrix that pairs methods with strategies. Some combinations have already been tried—Beam search for CART, Beam search for feature selection, Lookahead SGD, Lookahead Decision Forests, etc. Some others have been briefly explored, e.g., Beam search for FSR, Bagged FSR, Lookahead K-Means Initialization, Node Level Stability Selection Using CV CART, Beam-GD, etc. Others remain unexplored.

The empty cells point to two kinds of opportunity: where we can recombine known parts, and where we might need new ideas entirely. This isn’t just a bookkeeping exercise. It’s a way of contributing to science and engineering—by mapping what’s been done, seeing what hasn’t, and methodically filling in the gaps. Progress, in this view, comes not just from invention but also from recombination and completion.