By Gaurav in ML/Statistics — 08 Jul 2025

(Better) Split Decisions: Toward A More Structurally Stable CART

Classification and Regression Trees (CART) condense data into one transparent decision tree. Their weakness is structural instability: a small change to the training data,can replace an early split, such as “age > 65” with “income > $50 k,” propagating a completely new tree. This instability makes trees unsuitable for applications where consistency matters, like medical diagnosis protocols or regulatory compliance systems.

To make CART usable in those settings, we need structural stability: the tree should retain the same set of feature-threshold splits when refit on a fresh sample from the same population. (For a longer discussion on how to measure similarity between trees, see here.) The central challenge is to increase structural stability without unduly sacrificing predictive accuracy.

A rich literature tackles problems adjacent to the structural stability of CART.

Ensemble Methods. Random Forests and Gradient Boosting make predictions by averaging hundreds of trees but forfeit single‑tree interpretability.
Bias‑corrected split tests. ctree (Hothorn et al., 2006), QUEST (Loh & Shih, 1997), GUIDE (Loh, 2002), ALOOF (Painsky & Rosset, 2015)—remove the high‑cardinality preference in CART but do not tackle instability.
Common Structure in Diverse Trees.
1. Fuse Trees. Consensus Decision Trees (Kavšek et al., 2001) and successors such as SIRUS (Benard et al., 2019) first generate many diverse trees and then merge their common structure.
2. Rule‑extraction methods. inTrees/STEL (Deng 2014) compresses a large tree ensemble into a short, relatively stable rule list. The procedure harvests every root‑to‑leaf path from a Random Forest, prunes redundant conditions, and retains only those paths that (i) occur frequently across the perturbed trees and (ii) maintain high accuracy on hold‑out data. Because support is tallied over hundreds of bootstrap variations, high‑frequency rules capture signal rather than sampling noise, so the final list is substantially more repeatable than a single CART.
Split‑randomisation schemes. Random tie‑breaking, STABLEDT (Hara and Yoshida 2023), Extremely Randomised Trees (Geurts et al. 2006) inject randomness inside the split chooser so that near‑equivalent splits are sampled rather than chosen deterministically; this smooths the tree‑selection function in expectation and lowers average sensitivity, but each individual realization remains random and therefore structurally unstable unless you aggregate many such trees or apply a post‑hoc consensus step.
Optimal Trees. Optimal Classification Trees (pdf) (Bertsimas and Dunn, 2017) use mixed-integer optimization to find globally optimal trees rather than greedy solutions. OCT is not stable to data perturbations: a single new observation can shift the objective enough to yield a different optimum, just as in CART.

Centroid Tree. Bertsimas & Digalakis (2023) introduce a linear‑program–based path‑matching distance that quantifies how much two trees differ in split variables, thresholds, and leaf labels. They then treat stability as a post‑processing task: build a large pool of candidate trees (from CV folds, subsamples, hyper‑grid) and select the centroid—the tree with the smallest average path distance to all others, subject to accuracy and size constraints.

Potentially Promising

Two‑stage CART. A few applied studies (e.g., Goo et al., 2016; paper) first run Lasso to fix a predictor subset and then grow a CART on that reduced feature space. The two-stage CART potentially stabilizes which variables enter the model, but leaves split points and split order just as volatile. With correlated predictors, the screen itself can wobble, so any gain in tree stability is case-by-case.
Reducing the search space. Limiting the search grid—quantile or histogram bins in Spark MLlib’s maxBins, LightGBM (Ke et al., 2017), or the MDL discretiser of Fayyad and Irani (1992)—shrinks the decision space and thus the opportunity for near‑tie flips. Related ideas appear in SIRUS. The idea doesn't seem to have been applied to CART.

Trivial Solution

Each split can be unstable, so instability scales with the number of decisions. Adding regularisation penalties—such as MDL or depth-based costs—promotes earlier stopping, producing a shallower, more stable tree at the expense of some accuracy.

Proposals and Evaluation

RobustCART (node‑wise out‑of‑bag validation). At every node, we draw 100 random subsamples (30–70 % of the rows that reach that node), fit a decision stump on each in‑bag partition, record its hold‑out loss, and keep the split with the best average loss.
Rationale: choose splits that generalize locally rather than those that merely maximize in‑sample impurity reduction.
Frequency‑vote. Grow B resampled trees (bootstrap or subsample); keep every split that appears in ≥ 50 % of them. Choose the mode when there is no majority winner.
Rationale: majority voting should filter out idiosyncratic splits while producing a single interpretable tree.
MWU (Multiplicative‑Weights). Same pool of B trees, but splits are weighted by the exponential of each tree’s training error; τ and η are tuned with 3‑fold CV on the fly.
Rationale: down‑weight poorly performing trees and see whether a soft voting scheme outperforms the simple 50 % rule.
Resampling strategies. Two data‑perturbation schemes feed all variants: the classic bootstrap and 80 % without‑replacement subsampling.
Rationale: test whether bootstrap or cleaner subsamples lead to higher structural consensus.

Experimental Setup

Success: For trees T₁ and T₂ with split sets S₁ and S₂:

Jaccard(S₁, S₂) = |S₁ ∩ S₂| / |S₁ ∪ S₂|
Ranges from 0 (no common splits) to 1 (identical trees)
Average pairwise Jaccard across all tree pairs

Data Simulation Setup

We designed five scenarios to stress-test different failure modes:

High Noise: 600 samples, 60 features (8 informative), noise σ=0.45
Tests whether consensus filters out noise-driven splits
Duplicate-Correlated: Features duplicated with small perturbations
Tests handling of multicollinearity
Class Imbalance: 80-20 class distribution
Tests stability under skewed data
Sparse High-D: 800 samples, 1000 features (10 informative)
Tests feature selection in p >> n regime
XOR Non-linearity: Non-linear decision boundaries
Tests whether consensus preserves complex patterns

Implementation Details

Parallelization with joblib optimizes wall‑time for the B‑tree builds.
Memory and runtime are logged alongside Jaccard stability and test accuracy.
Each variant is run 30 outer × 30 inner times (900 trees) and repeated under five random seeds.

Results

Across all five stress‑test scenarios, the baseline greedy CART is structurally fragile: with bootstrapping, its mean Jaccard stability never exceeds 0.013, and even an 80 % subsample lifts it only to 0.041.

Adding a consensus step changes the picture. When trees are grown on 80 % subsamples and their splits are merged by a simple frequency vote (≥ 50 % support), stability rises by one to two orders of magnitude—e.g., from 0.005 → 0.490 (class‑imbalance) and 0.002 → 0.435 (XOR). The MWU‑weighted vote is comparable, occasionally superior (high‑noise: 0.428 vs 0.299), but offers no consistent advantage.

Using the same voting schemes on bootstrap trees helps far less—typical Jaccard gains are only 0.05–0.15—confirming that duplicate weights in bootstraps add noise that consensus cannot fully cancel.

Crucially, these stability gains do not trade off accuracy. Out‑of‑sample accuracy climbs alongside stability: frequency‑vote/80 % subsample improves on greedy CART by 6–12 pp in every scenario (e.g., 0.773 → 0.851 in high noise, 0.683 → 0.769 in class imbalance), and MWU voting delivers almost identical accuracies.

More at: https://github.com/finite-sample/robust-cart

Work on Related But Separate Problems

Related Work

Potentially Promising

Trivial Solution

Proposals and Evaluation

Experimental Setup

Results

Subscribe to Gojiberries