Measuring Tree Stability

Measuring Tree Stability
Photo by billow926 / Unsplash

Consider training a decision tree to predict customer churn. You split your data in half and build a model on one half. The tree identifies tenure, monthly charges, and contract type as key factors. Satisfied with the results, you rebuild the model on the other random split—but now the tree uses completely different features: payment method, total charges, and number of services. Which tree should you trust? Why did the important factors change?

This is the problem of tree instability.

Tree stability refers to an algorithm's ability to produce similar trees when trained on different samples from the same population. A stable algorithm produces trees that tell the same story—they identify similar patterns, use similar features, and make similar predictions—even when trained on different subsets of data. An unstable algorithm produces trees that vary wildly with small changes in training data.

This is distinct from another notion of stability: robustness to small input perturbations like typos, image distortions, or measurement noise. That type of stability, often addressed through techniques like data augmentation, adversarial training, and Jacobian regularization, focuses on making models smooth and locally consistent. Our focus is different. We are concerned with the expected similarity between two models (or their outputs) trained on independent samples from the same population:

$$\text{Stability}(L) = \mathbb{E}_{t_1,t_2}\bigl[\operatorname{Sim}(T_1,T_2)\bigr]$$

where $t_1$ and $t_2$ are independent training samples drawn from the same population; $L$ denotes the learning algorithm; and $T_1=L(t_1)$ and $T_2=L(t_2)$ are the trees it produces. We deliberately leave similarity undefined for now—different applications may care about different aspects of tree similarity.

This kind of stability is important for three reasons:

  1. Trust: Users lose confidence in models that produce different explanations and predictions when trained on data from the same data distribution. If a model claims "contract type drives churn" on Monday but "payment method drives churn" on Tuesday, stakeholders lose faith. Worse, when the same customer gets different risk scores as the model is retrained on new data from the same population, it undermines confidence in the entire system.
  2. Generalization: High instability signals overfitting and poor variance-bias tradeoff. Algorithmic stability theory (Bousquet & Elisseeff, 2002) proves that stable algorithms have bounded generalization error. If an algorithm's predictions change by at most β when one training example changes, then the gap between training and test error is bounded by a function of β.
  3. Interpretability: Unstable trees undermine the explanatory value of tree models

Addressing instability, however, isn't straightforward because different applications care about different types of tree consistency, and instability itself stems from multiple sources. This essay examines stability across three dimensions: how trees are built (structural stability), how they partition data (partition stability), and what they predict (behavioral stability). By understanding what drives instability in each dimension, we can develop targeted strategies to improve the consistency that matters most.

Structural Stability

Trees can differ structurally at three nested levels, progressing from minor threshold variations to completely different variable selections. Understanding these levels helps diagnose why tree models may be unstable and suggests targeted solutions.

  1. Level 1: Threshold variability. Same variables, same structure, different thresholds. When trees use identical features in identical positions but split at different values. It tends to be very common.

    Example: One tree splits at age > 30 while another splits at age > 32, with otherwise identical structure.

    Root cause: Flatness of impurity functions near optimal splits. When multiple threshold values perform similarly, small changes in training data can tip the selection.

    Measurement: Variance of thresholds at corresponding nodes across bootstrap samples.

    Solutions: Only make high-quality splits (minimum Gini improvement criteria), search across fewer values, e.g., quartiles, use robust metrics for a split, etc.
  2. Level 2: Topology variability. Same variables, different structure. Trees select the same features but arrange them differently—one tree splits age then income, and another splits income then age.

    Example: Both trees use {age, income, education}, but Tree A splits age→income→education while Tree B splits income→age→education.

    Root Cause: Multiple variable orderings achieve similar impurity reduction when features have comparable discriminative power.

    Measurement:
      1. Tree edit distance: Normalized by tree size
      2. Proportion of similar parent-child pairs. Count parent-child relationships that differ: |{(parent, child) ∈ T₁} Δ {(parent, child) ∈ T₂}| where Δ is symmetric difference and divide by total parent-child pairs in the tree with more nodes.
      3. Path-based similarity: Proportion of root-to-leaf paths that use the same sequence of variables (Miglio & Soffritti, 2004)
      4. Subtree overlap: Proportion of shared subtree structures across trees.

        Solutions: Only make high-quality splits (splitCV, etc.) and search across fewer values, e.g., quartiles, make better split decisions, etc.
  3. Level 3: Feature Variability. Different variables. Trees select different features. Note that this level inherently includes structural instability since different features necessitate different tree structures.

    Example: Tree A uses {age, income, education} while Tree B uses {location, occupation, credit_score}.

    Root Cause: Many correlated predictors, or when signal-to-noise ratios are low.

    Measurement:
        1. Jaccard similarity: |F₁ ∩ F₂| / |F₁ ∪ F₂| (Kalousis et al. 2007)
        2. Kuncheva index: corrects for chance overlap (Kuncheva 2007)
        3. Nogueira et al. (2018) formalization.

Solutions: A 2-stage approach with variables in the first stage selected by Lasso or stability selection, etc., could help.

Some people use proxy measures like Variable Importance Stability for measuring interpretive stability. But correlation of importance rankings conflates feature and structural stability since importance depends on both selection frequency and tree position.

Partition Stability: Do Trees Partition The Space Similarly?

Trees may use different features yet create similar partitions. This captures whether trees make similar distinctions between instances regardless of the explanatory mechanism used.

When we say "partition stability," we could mean two different things:

  1. Final partition similarity: Do trees create similar end-state groupings of data points?
  2. Hierarchical separation similarity: Do trees agree on which pairs of data points are "easy to separate" (separate early in the decision process) versus "hard to separate" (remain together deep into the tree)?

ARI and Fowlkes-Mallows index can be applied to trees by comparing their final leaf assignments. But two trees could have identical final partitions but disagree fundamentally about which point pairs are inherently similar (should separate late) versus dissimilar (should separate early). For that, Cophenetic Correlation may be a better measure.

For a decision tree $T$ and dataset $X = {x_1, x_2, \ldots, x_n}$, we define the cophenetic distance between two data points as:

$d_T(x_i, x_j) = \text{depth of the deepest node where } x_i \text{ and } x_j \text{ follow the same path}$

This measures how "late" in the decision process two points separate. Points that separate early (near the root) are very different according to the tree. Points that separate late (near leaves) are similar.

The correlation between the cophenetic matrices of two trees measures their agreement on instance-pair similarity:

ρcoph(T1,T2)=Corr(upper triangle of CT1,upper triangle of CT2)

Predictive Stability: Do Trees Make Similar Predictions?

Regardless of how trees partition the space or which features they use, do they produce the same outputs for the same inputs?

Measurement:

  • Regression: MSE between predictions: (1/n)Σᵢ(T₁(xᵢ) - T₂(xᵢ))²
  • Classification: Agreement rate: (1/n)Σᵢ𝟙[T₁(xᵢ) = T₂(xᵢ)]
  • Correlation: Pearson correlation between prediction vectors
  • Turney's Agreement Metric:
    $\text{Agreement}(T_1, T_2) = \Pr_{x \sim D}[T_1(x) = T_2(x)]$

    where $D$ is the data distribution and $x$ represents a randomly drawn instance.

    Empirical estimation: For a test dataset ${x_1, x_2, \ldots, x_m}$:

    $\hat{\text{Agreement}}(T_1, T_2) = \frac{1}{m} \sum_{i=1}^{m} \mathbf{1}[T_1(x_i) = T_2(x_i)]$

    where $\mathbf{1}[\cdot]$ is the indicator function.

You can use the idea of randomly drawn vectors elsewhere, e.g., partition stability, to get at whether trees have learned similar spatial organization principles beyond the data. (Using random vectors may emphasize irrelevant regions of the input space.)

See also: https://www.gojiberries.io/better-split-decisions-more-stable-cart-by-using-subsampled-node-level-cv/

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe