By Gaurav in ML/Statistics — 11 Jul 2025

Boosting Stability: Fixing XGBoost Instability Under Row Permutation

Shuffle your training data, and XGBoost might give you a different model. Even when you keep features, hyperparameters, and random_state fixed. This behavior violates what most practitioners reasonably expect: that models should be invariant to row permutation. And can lead to silent drift, flaky tests, and spurious alerts.

This, however, isn't an implementation quirk. It reflects a trade-off between speed and determinism.

Illustrative Impact

On some simulated data, we ran XGBoost 15 times on the same data, but shuffled the rows before each fit. It produced different predictions—even with the same seed. The average out-of-sample prediction RMSE between runs was about 3 percentage points.

Root Cause

Histogram binning is the real culprit—specifically the parallel quantile sketch used by tree_method='hist' when XGBoost trains with multiple threads. When one thread builds the histogram, bin cut‑points are deterministic and row‑order stable. With multiple threads, each worker sketches a chunk of rows. When those sketches are merged the final bins depend on how the chunks were formed—hence on row order.

Turning on row subsampling (subsample < 1) makes this worse: shuffling rows now changes which examples participate in every boosting round and how their values are binned. Even with random_state fixed, the RNG advances in a different sequence because it traverses the rows in a different order.

Column subsampling (colsample_bytree < 1) adds a smaller, second‑order source of noise: the random feature subset can differ across runs, nudging split choices—even with a fixed random_state.

Solutions

Enforce Determinism.

The simplest solution is to remove the root causes. Set tree_method='exact' to avoid histogram binning entirely, set subsample=1 and colsample_bytree=1 to turn off sampling, and use a fixed random_state

params = {
    'tree_method': 'exact',      # Skip histogram approximation
    'subsample': 1.0,            # Use all rows
    'colsample_bytree': 1.0,     # Use all features
    'random_state': 42           # Fix random seed
}

With this setup, we confirmed that shuffling the training rows yields identical predictions across retrains.

Embrace Controlled Randomness. If you're willing to tolerate small differences for speed, use tree_method='hist' but average predictions over a few shuffled fits (simple ensembling). You may also want to track prediction variance across retrains as a quality metric and set clear expectations about anticipated variance.
Switch Algorithms. For practitioners seeking stable training by default, consider alternatives:

CatBoost, which uses ordered boosting and is deterministic out-of-the-box
LightGBM with deterministic=True and force_row_wise=True (required for categorical features)

Conclusion

This isn't a bug—it's a deliberate engineering choice. XGBoost trades exact reproducibility for the speed that makes it practical on real-world datasets. But this trade-off has implications for reproducibility, debuggability, and compliance.

More: https://github.com/finite-sample/stableboost

Postscript

XGBoost isn’t alone. Many other supervised algorithms exhibit row-order sensitivity under typical usage:

Random Forests: bootstrap samples or data sharding may depend on row position
SGD-based models (e.g., neural nets, logistic regression with mini-batches): different order → different batches → different paths
Coordinate descent (e.g., randomized Lasso): ordering can affect convergence and solution

p.p.s. Most popular unsupervised algorithms— k-means, DBSCAN (in areas where parameter/data combo leaves ambiguous border points, generally with small eps or large min_samples), Gaussian Mixture Models, t-SNE, autoencoders, etc.—are unstable.

Illustrative Impact

Root Cause

Solutions

Conclusion

Postscript

Subscribe to Gojiberries