By Gaurav in ML/Statistics — 21 Mar 2026

Regression to the Mean Model

Shuffle your XGBoost training data and you can get a different model. Not different hyperparameters or different features, just a different row order. With tree_method='hist' and multiple threads, the parallel quantile sketch that determines bin boundaries depends on how rows are chunked across workers (see stableboost and the associated blog). The fixes there are architectural: use tree_method='exact', disable subsampling, switch to CatBoost's ordered boosting, or ensemble across shuffles.

Neural networks exhibit the same instability but from a different source. Training the same MLP on the same data with different random seeds can produce models whose per-example prediction variance is roughly three orders of magnitude larger when initialization is varied than when only shuffle order is varied. On UCI Digits with 30 MLPs, initialization-only variance exceeds shuffle-only variance by about 1000x. At convergence on that dataset, minibatch ordering barely matters. The starting point determines which basin the optimizer settles into. Unlike XGBoost, there is no deterministic mode that eliminates the problem. Neural networks cannot be forced to converge to a single basin.

The standard responses each carry costs. Serving the full ensemble multiplies inference cost by the ensemble size. Knowledge distillation adds a second training phase with its own hyperparameters. Weight averaging methods (SWA, SWAD, Ensemble of Averages) operate in parameter space and assume models share enough structure for weight interpolation to be meaningful. Bayesian Model Averaging in practice tends to collapse to selecting the model with highest validation likelihood. None of these address the setting where accuracy is effectively tied across seeds and the goal is to select one model with low prediction variance, without retraining or increasing inference cost.

The method proposed in stable_selection is simple. Train k models with different random seeds. On a validation set, compute the ensemble mean predicted probabilities. Select the model whose predictions have the smallest L2 divergence to that mean (the mean per-example sum of squared probability differences; KL, JS, and Hellinger all work comparably) and deploy it. The ensemble is used at selection time only.

The intuition is straightforward. The ensemble mean is the minimum-variance point in prediction space for the set of trained models. A model that sits close to it has landed, by luck of initialization, in a basin that produces consensus predictions. It is the least idiosyncratic member of the pool. Selecting it cannot improve accuracy, since all k models have approximately the same accuracy, but it can reduce the variance of predictions on new data.

The evidence comes from a factorial simulation over 22 data-generating-process configurations varying class separation, label noise, number of classes, sample size, and feature redundancy. Each configuration trains pools of K=12 models and repeats the full pipeline M=6 times with independent seeds. Ensemble proximity selection reduces out-of-sample prediction variance by a mean of 7.4% (median 7.6%, 95% bootstrap CI [-0.7%, +14.9%]), with a positive effect in 18 of 22 configurations. Mean accuracy cost is +0.004 percentage points. The gains concentrate where variance is highest: hard problems (class separation 0.3) yield +20.1%, small samples (n=300) yield +19.9%, and the benefit scales with the number of classes. Easy binary problems with well-separated classes show near-zero or slightly negative effects, as expected when models already converge to near-identical predictions and the selection criterion overfits the validation set.

The practical diagnostic is prediction disagreement, defined as the fraction of test examples on which the pool's hard predictions are not unanimous. It correlates with variance reduction at rho=0.29 across the 22 configurations. Validation-to-test transfer of the proximity ranking averages rho=0.50, with 5 of 22 configurations showing significant transfer after FDR correction. Among single-model selection criteria, ensemble proximity (+7.7%) outperforms random selection (-1.9%), minimum ECE (which increases variance), maximum entropy (+4.0%), and median disagreement.

Subscribe to Gojiberries