Beam-GD
Gradient descent commits to a single direction at each step based on the local gradient. This myopic approach can be suboptimal when gradients are noisy, local geometry is misleading, or the loss landscape has multiple competing descent directions. The algorithm makes irrevocable decisions based on local information, potentially missing better optimization trajectories.
We apply beam search principles to numerical optimization. Instead of following a single trajectory, we maintain multiple parameter sets and explore different gradient directions simultaneously, retaining only the top performers.
Algorithm
Given loss function $L(\theta)$ and beam width $k$, maintain parameter sets ${\theta_t^{(1)}, \ldots, \theta_t^{(k)}}$:
- Expansion: For each $\theta_t^{(i)}$, generate candidate directions:
- $g_1 = \nabla L(\theta_t^{(i)})$ (true gradient)
- $g_j = \nabla L(\theta_t^{(i)}) + \epsilon_j$ for $j = 2, \ldots, m$ where $\epsilon_j \sim \mathcal{N}(0, \sigma^2 I)$
- Update: Create candidates $\theta_{t+1}^{(i,j)} = \theta_t^{(i)} - \alpha g_j$
- Selection: Keep top $k$ by loss
The noise injection provides exploration while selection pressure drives exploitation of promising directions.
Experimental Setup and Results
Six optimization problems: High-dimensional quadratic (50D), Rosenbrock function (10D), ill-conditioned quadratic (20D), noisy quadratic (30D), neural network regression (221 parameters), and logistic regression (16D). Each was tested with five random starts, beam width k=3, noise scale σ=0.5, learning rate α=0.001, 500 iterations maximum.
Beam search GD achieved better final loss on 5/6 problems:
Problem | Improvement | Cost Ratio |
---|---|---|
High-Dim Quadratic | +5.69% | 18x |
Ill-Conditioned | +7.75% | 18x |
Noisy Quadratic | +4.24% | 18x |
Neural Network | +13.18% | 18x |
Logistic Regression | +11.37% | 18x |
Rosenbrock | Failed (NaN) | 18x |
The algorithm showed consistent improvements across problem types but required ~18x more function evaluations (though this is likely an overestimate as beam will converge to the optimal solution quicker and the current setup runs till 500 iters as optimal solutions weren't found in the first 500 iterations). (Rosenbrock caused numerical instabilities for both methods.) However, the algorithm is embarrassingly parallel - all candidate evaluations are independent within each iteration.
Analysis
The 4-13% improvements over vanilla gradient descent are substantial but must be contextualized. Modern optimizers like Adam incorporate momentum and adaptive learning rates, likely reducing the performance gap significantly. Adam's adaptive nature already addresses many issues that beam search targets - gradient noise and poor conditioning. Applying beam search to Adam (Beam Adam) is possible but probably yields diminishing returns.
Better Candidate Generation: One limitation of the method is that Gaussian noise is a primitive exploration strategy that ignores optimization structure. More principled approaches should leverage the geometry and history of the optimization process. Momentum-based candidates could explore different velocity trajectories rather than just perturbing the current gradient, potentially discovering paths that benefit from accumulated directional bias. Quasi-Newton directions incorporating curvature information could generate candidates that account for local landscape shape rather than assuming isotropic descent. Orthogonal exploration - generating candidates perpendicular to the gradient explores contour lines rather than just descent directions, could potentially find ridge-walking paths that traditional methods miss. Adaptive noise scaling based on gradient reliability or optimization progress could provide principled exploration intensity, reducing noise when gradients are trustworthy and increasing it when stuck in poor regions.
Code at github.com/finite-sample/beamgd