Separable Weighted Leaf-Collision Proximities
- The paper introduces SWLCP as a novel approach that leverages tree ensemble leaf-collision structures to define supervised similarity measures.
- It employs separable, sample-local weighting and sparse matrix factorization to reduce the computational cost from quadratic to near-linear scaling.
- Empirical results demonstrate that SWLCPs achieve significant improvements in runtime and memory usage on large-scale datasets.
Separable Weighted Leaf-Collision Proximities (SWLCPs) constitute a mathematically rigorous family of supervised similarity measures defined via the leaf co-occurrence structure in tree ensembles such as Random Forests and Gradient Boosted Trees. SWLCPs generalize the notion that tree ensembles induce proximities based on the extent to which sample pairs collide, i.e., are assigned to the same leaf, with collisions modulated through separable, sample-local weighting schemes. The SWLCP structure enables exact, scalable computation by leveraging sparse matrix factorization, circumventing the quadratic time or memory complexities inherent to traditional explicit pairwise proximity formulations (Aumon et al., 6 Jan 2026).
1. Formal Framework and Definition
Let denote the number of samples and the set of trees in the ensemble. For a sample and tree , denote by the index of the unique leaf of tree containing . The Weighted Leaf-Collision Proximity (WLCP) matrix is defined as
where is a collision weight and is the indicator function. A WLCP is termed separable if
for nonnegative, sample-local vectors , where and are specific to individual samples and trees but independent of the paired sample index.
Common proximities, including those underlying classical Random Forests, are instances of the separable form. For example, setting and recovers the original Random Forest (RF) proximity.
2. Sparse Matrix Factorization
The defining property of SWLCPs is that they admit an exact sparse matrix factorization, which enables scalable computation. Denote as the set of all leaf nodes across all trees, with . Construct sparse matrices as follows:
- For each leaf , associate a unique column.
- For sample and leaf , set and , where identifies the tree containing leaf .
With these definitions, the SWLCP matrix factorizes as
with each row of and containing at most one nonzero per tree (and thus at most nonzeros per row). This formulation restricts computation to leaf-level collisions and avoids the cost of explicit pairwise comparisons.
| Proximity variant | ||
|---|---|---|
| RF (original) | $1/T$ | $1$ |
| RF-GAP | ||
| GBT | $1$ |
3. Computational Workflow and Pseudocode
The scalable computation of SWLCPs proceeds through the following steps:
- Leaf-to-Column Mapping: Enumerate all unique leaves across all trees and map each to a unique column index, yielding columns.
- Sparse Matrix Construction: Build index and data arrays for and with nonzero entries determined by the sample-leaf assignments and the respective , .
- Sparse Matrix Multiplication: Form and as CSR/CSC matrices and compute using optimized sparse linear algebra routines.
A high-level pseudocode summary is as follows:
1 2 3 4 5 |
1. For each tree t and each leaf in t, map (t, leaf_id) → column index. 2. For each sample i and tree t: - Assign q[i, t] to Q[i, k], and w[i, t] to W[i, k] where k = mapping[(t, leaf_id)]. 3. Construct Q and W as sparse matrices. 4. Compute P = Q.dot(W.T). |
By construction, .
4. Computational Complexity Analysis
The classical explicit pairwise approach evaluating all sample pairs per tree incurs computational cost and memory. The sparse factorization method introduced for SWLCPs reduces this to near-linear in :
- Tree traversal and leaf encoding: , where is average tree height.
- Sparse matrix assembly: .
- Sparse multiplication: Each has nonzeros; each contains at most nonzeros (for average leaf size ), leading to .
- Total time complexity: .
Memory usage is dominated by the storage of , , and sparse , with and nonzeros, respectively, as opposed to for dense approaches.
5. Illustrative Construction: Toy Example
Consider samples and trees with leaves and . Assigning samples as , , and , and using the RF proximity (, ), the $4$ columns of and encode sample-leaf memberships. The resulting is:
This corresponds to the fraction of trees for which sample pairs share a leaf.
6. Practical Implementation Using Python
Efficient construction of SWLCPs leverages NumPy and SciPy's sparse matrix routines. The primary operations consist of assigning sample-leaf-weight triples to sparse index/value arrays, constructing and as CSR matrices, and computing their product. The following fragment demonstrates core code elements:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import numpy as np import scipy.sparse as sp N, T = leaf_ids.shape unique_leaves = {} col = 0 for t in range(T): for leaf in np.unique(leaf_ids[:, t]): unique_leaves[(t, leaf)] = col col += 1 K = col rows_Q, cols_Q, data_Q = [], [], [] rows_W, cols_W, data_W = [], [], [] for i in range(N): for t in range(T): leaf = leaf_ids[i, t] k = unique_leaves[(t, leaf)] rows_Q.append(i); cols_Q.append(k); data_Q.append(q_weights[i, t]) rows_W.append(i); cols_W.append(k); data_W.append(w_weights[i, t]) Q = sp.csr_matrix((data_Q, (rows_Q, cols_Q)), shape=(N, K)) W = sp.csr_matrix((data_W, (rows_W, cols_W)), shape=(N, K)) P_sparse = Q.dot(W.T) |
This organization ensures that only a linear number of nonzeros () are handled in sparse arrays and products, and dense structures are never explicitly formed unless required.
7. Empirical Performance and Scalability
Empirical evaluation on the Fashion-MNIST dataset () demonstrates the practical impact of the SWLCP factorization strategy (Aumon et al., 6 Jan 2026):
- Runtime: Traditional pairwise computation scales quadratically, exceeding 20 minutes for . The SWLCP method achieves near-linear scaling (empirical exponent ), requiring seconds for the same .
- Memory: The explicit approach exhausts memory at –; the SWLCP approach remains below 4 GB for , scaling linearly due to storage.
This confirms that restricting to leaf-level collisions and utilizing separable weighting reduces real-world computational cost from intractable to essentially linear in .
For detailed derivations, algorithmic improvements, and further results, see "Scalable Tree Ensemble Proximities in Python" (Aumon et al., 6 Jan 2026).