By Gaurav in ML/Statistics — 02 Aug 2025

A Lightweight ALS Solver for Iterative GLS

When generalized least squares spans hundreds of equations, the error covariance becomes the problem. Every loop that updates $\beta$ drags along a $K \times K$ inverse; storing it costs about $O(K^2)$ memory and inverting it costs about $O(K^3)$ time. Even when we model the covariance statistically as low-rank + diagonal, the classic EM routine recreates dense inverses in its M-step, so the computational bill mostly returns.

A gentler way is to keep the low-rank structure all the way through estimation. We assume the covariance is

$$\Sigma = F F^T + \text{diag}(D),$$

with $k$ factors ($k \ll K$). Here $F \in \mathbb{R}^{K \times k}$ is the factor-loading matrix, $D = \text{diag}(d_1, \ldots, d_K)$ collects equation-specific variances, and the parameter count drops from $\sim K^2$ to $Kk + K$. The picture is simple: a few shared directions move many equations together (the columns of $F$), and a per-equation noise level $D$ fills in the rest. The difficulty is algorithmic: fit the model without rebuilding dense matrices every iteration.

Algorithm Design

The algorithm follows that picture. Rather than invert $\Sigma$ directly, we use the Woodbury identity so all hard work collapses to a tiny $k \times k$ system; the diagonal part is inverted element-wise. Concretely,

$$\Sigma^{-1} = D^{-1} - D^{-1} F (I + F^T D^{-1} F)^{-1} F^T D^{-1}.$$

For the $\beta$ step we never assemble the block normal equations at all. We expose a linear operator that applies $X^T \Sigma^{-1} X$ using Woodbury and solve the resulting system with conjugate gradients and a diagonal preconditioner. No dense $K \times K$ arrays are ever formed. Then, given residuals, we refresh the factors with two ridge least-squares updates and recompute the diagonal from residual variances. A small ridge and a guarded stopping rule keep the iterations tame. In practice we see convergence in a handful of sweeps.

Uncertainty Calibration

Uncertainty needs to be honest, so after fitting we add a light calibration pass. Four scalars handle most of the gap between model and data:

$\alpha$ rescales the diagonal $D$
$\gamma$ inflates factor strength without changing per-equation variances
$\tau$ optionally adds a tiny data-driven low-rank boost
$s$ scales the whole covariance

We pick them by conservative cross-validation to get sensible z-coverage and Mahalanobis tails, then apply them to the test set.

A useful rule of thumb: if your GLS routine keeps alternating between $\beta$ and a fresh $\hat{\Sigma}$, replacing the $\hat{\Sigma}$-update with this factor-ALS step preserves the statistical fit while cutting the memory footprint by an order of magnitude.

Empirical Results

In a synthetic benchmark that isn't toy-sized ($K = 120$; SUR with $p = 3$, and GLS with heterogeneous $p_j \in {2,4,8}$; train $N_{tr} = 240$, test $N_{te} = 120$; $k = 8$; $\lambda_F = \lambda_B = 10^{-3}$), ALS and EM land on essentially the same fit while using very different amounts of compute.

SUR Results

For SUR, ALS finishes in 0.110 s and uses 0.0086 MB, while EM takes 3.673 s and 1.152 MB. Test MSE is 11.5413 (ALS) vs 11.5410 (EM); calibrated test NLL per sample is −48.966 (ALS) vs −48.635 (EM). That is about 33× faster and 130× lighter with indistinguishable accuracy.

Heterogeneous GLS Results

For heterogeneous-design GLS, ALS finishes in 0.105 s and uses 0.0086 MB, while EM takes 1.794 s and 2.553 MB. Test MSE is 11.8980 for both; calibrated test NLL per sample is −50.190 (ALS) vs −51.300 (EM). Here EM has a small NLL edge, but ALS is about 17× faster and 295× lighter.

Broader Applications

This pattern shows up beyond SUR/GLS:

Variance-components and random-effects updates (REML)
Feasible GLS with heteroskedastic weights
Optimal-weight GMM (moment-covariance updates)
Spatial GLS (where low-rank bases can capture broad spatial patterns)

All benefit from avoiding dense $K \times K$ inverses when a few shared directions dominate.

More at: https://github.com/finite-sample/alsgls

Algorithm Design

Uncertainty Calibration

Empirical Results

Broader Applications

Subscribe to Gojiberries