Measuring Tree Similarity
What is the expected similarity, E[Sim(T₁, T₂)], where T₁ and T₂ are trees trained on perturbed (or bootstrapped) versions of the same dataset? Tree stability under data perturbation can be decomposed into interpretive stability (how trees explain decisions) and behavioral stability (what trees predict).
Interpretive Stability
Trees can differ at three nested levels, from threshold variations to completely different variables:
- Level 1: Threshold Instability. Same variables, same structure, different thresholds. When trees use identical features in identical positions but split at different values (e.g., age > 30 vs age > 32). It tends to be very common.
Root cause: Flatness of impurity functions near optimal splits.
Measurement: Variance of thresholds at corresponding nodes across bootstrap samples.
Solutions: Only make high-quality splits (Loh, 2009; Hothorn et al. 2006) and search across fewer values, e.g., quartiles. - Level 2: Structural Instability. Same variables, different structure. Trees select the same features but arrange them differently—one tree splits age then income, and another splits income then age.
Root Cause: This occurs when multiple hierarchies achieve similar impurity reduction.
Measurement: Tree edit distance (normalized by tree size) between T₁ and T₂, restricted to trees using the same feature set. Alternatively, count parent-child relationships that differ: |{(parent, child) ∈ T₁} Δ {(parent, child) ∈ T₂}| where Δ is symmetric difference. Miglio & Soffritti (2004) proposed graph-theoretic measures for tree comparison. Proportion of paths (measured in terms of variable names) or shared sub-trees are two examples.
Solutions: Only make high-quality splits (Loh, 2009; Hothorn et al. 2006) and search across fewer values, e.g., quartiles, use mixed integer programming to find the optimal tree (Bertsimas & Dunn 2017), make better split decisions via splitCV, etc. - Level 3: Feature Instability. Different variables. Trees select fundamentally different features. Common in high-dimensional settings with correlated predictors or weak signals.
Measurement: - Jaccard similarity: |F₁ ∩ F₂| / |F₁ ∪ F₂| (Kalousis et al. 2007)
- Kuncheva index: corrects for chance overlap (Kuncheva 2007)
- Nogueira et al. (2018) formalization.
Solutions: 2-stage with variables in the first stage selected by Lasso, etc.
Some people use proxy measures like Variable Importance Stability for measuring interpretive stability. But correlation of importance rankings conflates feature and structural stability since importance depends on both selection frequency and tree position.
Beyond interpretive stability, we can also measure partition stability and predictive stability.
Partition Stability: Do trees partition the space similarly?
Trees may use different features yet create similar partitions. This captures whether trees make similar distinctions between instances regardless of the explanatory mechanism.
Measurement: Methods from clustering, such as the Adjusted Rand Index (Hubert & Arabie, 1985) or the Fowlkes-Mallows index for hierarchical agreement, are flawed. Cophenetic Correlation may be a better measure. Here's the definition:
For a decision tree T and dataset X = {x₁, ..., xₙ}, define the cophenetic distance:
d_T(xᵢ, xⱼ) = depth of least common ancestor of xᵢ and xⱼ in T
This creates an n×n cophenetic matrix C_T where C_T[i,j] = d_T(xᵢ, xⱼ). For two trees T₁ and T₂ on the same data:
ρ_coph(T₁, T₂) = Corr(vech(C_T₁), vech(C_T₂))
where vech() extracts the upper triangle (excluding diagonal) as a vector. Formally:
ρ_coph = Σᵢ<ⱼ (C_T₁[i,j] - C̄_T₁)(C_T₂[i,j] - C̄_T₂) / √[Σᵢ<ⱼ (C_T₁[i,j] - C̄_T₁)² · Σᵢ<ⱼ (C_T₂[i,j] - C̄_T₂)²]
Properties:
- ρ_coph ∈ [-1, 1], with 1 indicating an identical hierarchical structure
- Invariant to monotonic transformations of depth
- Captures "easy vs hard to separate" instance pairs across trees
- Computational complexity: O(n²) for n instances
Normalized variant: To compare trees of different depths, use: d̃_T(xᵢ, xⱼ) = d_T(xᵢ, xⱼ) / depth(T)
Predictive Stability: Do trees make similar predictions?
How much do predictions vary across perturbations?
Measurement:
- Regression: MSE between predictions: (1/n)Σᵢ(T₁(xᵢ) - T₂(xᵢ))²
- Classification: Agreement rate: (1/n)Σᵢ𝟙[T₁(xᵢ) = T₂(xᵢ)]
- Correlation: Pearson correlation between prediction vectors
References
Archer, K. J., & Kimes, R. V. (2008). Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis, 52(4), 2249-2260.
Banerjee, M., Ding, Y., & Noone, A. M. (2021). Identifying representative trees from ensembles. Statistics in Medicine, 31(15), 1601-1616.
Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7), 1039-1082.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Bühlmann, P., & Yu, B. (2002). Analyzing bagging. The Annals of Statistics, 30(4), 927-961.
Calle, M. L., & Urrea, V. (2011). Letter to the editor: Stability of random forest importance measures. Briefings in Bioinformatics, 12(1), 86-89.
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). Preserving statistical validity in adaptive data analysis. STOC, 117-126.
Elisseeff, A., & Pontil, M. (2003). Leave-one-out error and stability of learning algorithms with applications. NATO Science Series III, 190, 111-130.
Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously. JMLR, 20(177), 1-81.
He, Z., & Yu, W. (2010). Stable feature selection for biomarker discovery. Computational Biology and Chemistry, 34(4), 215-225.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651-674.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218.
Kalousis, A., Prados, J., & Hilario, M. (2007). Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and Information Systems, 12(1), 95-116.
Kuncheva, L. I. (2007). A stability index for feature selection. Artificial Intelligence and Applications, 421-427.
Loh, W. Y. (2009). Improving the precision of classification trees. The Annals of Applied Statistics, 3(4), 1710-1737.
Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica Sinica, 815-840.
Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society B, 72(4), 417-473.
Mentch, L., & Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. JMLR, 17(1), 841-881.
Miglio, R., & Soffritti, G. (2004). The comparison between classification trees through proximity measures. Computational Statistics & Data Analysis, 45(3), 577-593.
Nogueira, S., Sechidis, K., & Brown, G. (2018). On the stability of feature selection algorithms. JMLR, 18(174), 1-54.
Philipp, M., Zeileis, A., & Strobl, C. (2016). A toolkit for stability assessment of tree-based learners. Proceedings of COMPSTAT, 315-325.
Questier, F., Put, R., Coomans, D., Walczak, B., & Heyden, Y. V. (2005). The use of CART and multivariate regression trees for supervised and unsupervised feature selection. Chemometrics and Intelligent Laboratory Systems, 76(1), 45-54.
Saeys, Y., Abeel, T., & Van de Peer, Y. (2008). Robust feature selection using ensemble feature selection techniques. ECML PKDD, 313-325.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.
Somol, P., & Novovičová, J. (2010). Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11), 1921-1939.
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307.
Turney, P. (1995). Technical note: Bias and the quantification of stability. Machine Learning, 20(1), 23-33.
Yu, B. (2013). Stability. Bernoulli, 19(4), 1484-1500.