Greedy is Good. Less Greedy May be Better.
Forward stepwise regression, agglomerative hierarchical clustering, and CART rely on a simple principle: make the best local choice at each step. Greedy choices can also be optimal when problems possess the greedy choice property—where globally optimal solutions can be reached through locally optimal decisions, as in minimum spanning trees or activity selection. However, in many statistical and machine learning contexts, myopic choices create path dependency: early decisions permanently constrain future options, potentially excluding globally superior solutions.
We often opt for greedy algorithms because they are speedy, and the solutions are good enough. But there's an entire spectrum of trade-offs between speed and solution quality. When we need better solutions, we can be strategically less greedy using four key approaches:
- Keep more options open. Instead of committing to the single best choice at each step, preserve multiple promising alternatives. Beam search does exactly this—maintaining the top-k solutions rather than just the top one. Preserving options allows us to delay some decisions till we have more information.
- Look ahead before choosing. Rather than optimizing for gains over the next step, evaluate how choices play out over multiple steps. This is like chess players who consider not just their next move, but several moves deep. Two strategies are often used to make the process more efficient: reusing solutions to overlapping subproblems and pruning unpromising paths early. The first is the foundation of dynamic programming, while the second is central to algorithms like branch and bound, which combine lookahead with smart elimination of bad options. Looking ahead can be especially valuable when evaluating early consequential decisions like initializations. See, for instance, Lookahead K-Means Initialization.
- Make better local decisions. Even within greedy frameworks, we can improve individual choices. Cross-validation replaces training error with more robust validation estimates. Regularization incorporates prior knowledge to avoid overfitting. Bagging averages across multiple random samples to reduce variance in each decision.
- Combine multiple greedy attempts. Run the greedy algorithm multiple times with different starting conditions, then ensemble the results.
There are two principles that underlie these approaches. First, the value of reducing greed is greater when the opportunity costs are greater. For many problems, early decisions have the highest opportunity costs because they constrain all future choices. Recognizing when opportunity costs are highest can enable refinements that improve solution quality even conditional on computational budgets. For instance, adaptive beam sizes that start wide and narrow over time, more rigorous validation for early decisions in tree building, and investing extra computational resources in initialization procedures may all prove useful. (*In gradient descent, it is common to have a declining learning rate schedule because of the structure of the problem, but may still benefit from more investment in the initial few iterations.)
Second, different optimization methods provide different opportunities for reducing greed. Gradient descent exemplifies this principle. The four less-greedy strategies all appear in GD proposals: Nesterov momentum looks ahead by evaluating gradients at predicted positions, Adam makes better local decisions using adaptive learning rates, multiple random restarts combine different attempts, and population-based methods keep multiple trajectories open. But GD can also leverage problem-specific information unavailable in other problems: gradients provide directional guidance, curvature information enables second-order methods, historical trajectories allow momentum-based smoothing, etc. This rich information landscape explains why GD improvements are highly developed compared to discrete algorithms like hierarchical clustering or stepwise regression, which have access only to distance metrics or statistical significance measures.
Matrix Completion
To make deliberate progress, we need more than isolated tricks—we need a framework. Here, we started by coming up with strategies to be less greedy. If we follow that by listing the greedy methods: forward stepwise regression, k-means, CART, etc., we get a design space. From there, we can build a matrix that pairs methods with strategies. Some combinations have already been tried—Beam search for CART, Beam search for feature selection, Lookahead SGD, Lookahead Decision Forests, etc. Some others have been briefly explored, e.g., Beam search for FSR, Bagged FSR, Lookahead K-Means Initialization, Node Level Stability Selection Using CV CART, Beam search for Agglomerative Clustering, Beam-GD, etc. Others remain unexplored.
The empty cells point to two kinds of opportunity: where we can recombine known parts, and where we might need new ideas entirely. This isn’t just a bookkeeping exercise. It’s a way of contributing to science and engineering—by mapping what’s been done, seeing what hasn’t, and methodically filling in the gaps. Progress, in this view, comes not just from invention but also from recombination and completion.