Boosting Stability: Fixing XGBoost Instability Under Row Permutation
Shuffle your training data, and XGBoost might give you a different model. Even when you keep features, hyperparameters, and random_state
fixed. This behavior violates what most practitioners reasonably expect: that models should be invariant to row permutation. And can lead to silent drift, flaky tests, and spurious alerts.
This, however, isn't an implementation quirk. It reflects a trade-off between speed and determinism.
Illustrative Impact
On some simulated data, we ran XGBoost 15 times on the same data, but shuffled the rows before each fit. It produced different predictions—even with the same seed. The average out-of-sample prediction RMSE between runs was about 3 percentage points.
Root Cause
Histogram binning is the real culprit—specifically the parallel quantile sketch used by tree_method='hist'
when XGBoost trains with multiple threads. When one thread builds the histogram, bin cut‑points are deterministic and row‑order stable. With multiple threads, each worker sketches a chunk of rows. When those sketches are merged the final bins depend on how the chunks were formed—hence on row order.
Turning on row subsampling (subsample < 1
) makes this worse: shuffling rows now changes which examples participate in every boosting round and how their values are binned. Even with random_state
fixed, the RNG advances in a different sequence because it traverses the rows in a different order.
Column subsampling (colsample_bytree < 1
) adds a smaller, second‑order source of noise: the random feature subset can differ across runs, nudging split choices—even with a fixed random_state
.
Solutions
- Enforce Determinism.
The simplest solution is to remove the root causes. Settree_method='exact'
to avoid histogram binning entirely, setsubsample=1
andcolsample_bytree=1
to turn off sampling, and use a fixedrandom_state
params = {
'tree_method': 'exact', # Skip histogram approximation
'subsample': 1.0, # Use all rows
'colsample_bytree': 1.0, # Use all features
'random_state': 42 # Fix random seed
}
With this setup, we confirmed that shuffling the training rows yields identical predictions across retrains.
- Embrace Controlled Randomness. If you're willing to tolerate small differences for speed, use
tree_method='hist'
but average predictions over a few shuffled fits (simple ensembling). You may also want to track prediction variance across retrains as a quality metric and set clear expectations about anticipated variance. - Switch Algorithms. For practitioners seeking stable training by default, consider alternatives:
- CatBoost, which uses ordered boosting and is deterministic out-of-the-box
- LightGBM with
deterministic=True
andforce_row_wise=True
(required for categorical features)
Conclusion
This isn't a bug—it's a deliberate engineering choice. XGBoost trades exact reproducibility for the speed that makes it practical on real-world datasets. But this trade-off has implications for reproducibility, debuggability, and compliance.
More: https://github.com/finite-sample/stableboost
Postscript
XGBoost isn’t alone. Many other supervised algorithms exhibit row-order sensitivity under typical usage:
- Random Forests: bootstrap samples or data sharding may depend on row position
- SGD-based models (e.g., neural nets, logistic regression with mini-batches): different order → different batches → different paths
- Coordinate descent (e.g., randomized Lasso): ordering can affect convergence and solution
p.p.s. Most popular unsupervised algorithms— k-means, DBSCAN (in areas where parameter/data combo leaves ambiguous border points, generally with small eps or large min_samples), Gaussian Mixture Models, t-SNE, autoencoders, etc.—are unstable.