Papers
Topics
Authors
Recent
Search
2000 character limit reached

Separable Weighted Leaf-Collision Proximities

Updated 13 January 2026
  • The paper introduces SWLCP as a novel approach that leverages tree ensemble leaf-collision structures to define supervised similarity measures.
  • It employs separable, sample-local weighting and sparse matrix factorization to reduce the computational cost from quadratic to near-linear scaling.
  • Empirical results demonstrate that SWLCPs achieve significant improvements in runtime and memory usage on large-scale datasets.

Separable Weighted Leaf-Collision Proximities (SWLCPs) constitute a mathematically rigorous family of supervised similarity measures defined via the leaf co-occurrence structure in tree ensembles such as Random Forests and Gradient Boosted Trees. SWLCPs generalize the notion that tree ensembles induce proximities based on the extent to which sample pairs collide, i.e., are assigned to the same leaf, with collisions modulated through separable, sample-local weighting schemes. The SWLCP structure enables exact, scalable computation by leveraging sparse matrix factorization, circumventing the quadratic time or memory complexities inherent to traditional explicit pairwise proximity formulations (Aumon et al., 6 Jan 2026).

1. Formal Framework and Definition

Let NN denote the number of samples and T={1,…,T}\mathcal{T} = \{1, \ldots, T\} the set of TT trees in the ensemble. For a sample xix_i and tree tt, denote by ℓi(t)\ell_i(t) the index of the unique leaf of tree tt containing xix_i. The Weighted Leaf-Collision Proximity (WLCP) matrix P∈RN×NP \in \mathbb{R}^{N \times N} is defined as

Pij=∑t=1Twijt⋅I(ℓi(t)=ℓj(t)),P_{ij} = \sum_{t=1}^T w_{ijt} \cdot I(\ell_i(t) = \ell_j(t)),

where wijtw_{ijt} is a collision weight and II is the indicator function. A WLCP is termed separable if

wijt=qitâ‹…wjt,w_{ijt} = q_{it} \cdot w_{jt},

for nonnegative, sample-local vectors q⋅t,w⋅t∈RNq_{\cdot t}, w_{\cdot t} \in \mathbb{R}^N, where qitq_{it} and wjtw_{jt} are specific to individual samples and trees but independent of the paired sample index.

Common proximities, including those underlying classical Random Forests, are instances of the separable form. For example, setting qit=1/Tq_{it} = 1/T and wjt=1w_{jt} = 1 recovers the original Random Forest (RF) proximity.

2. Sparse Matrix Factorization

The defining property of SWLCPs is that they admit an exact sparse matrix factorization, which enables scalable computation. Denote L\mathcal{L} as the set of all leaf nodes across all trees, with K=∣L∣K = |\mathcal{L}|. Construct sparse matrices Q,W∈RN×KQ, W \in \mathbb{R}^{N \times K} as follows:

  • For each leaf k∈Lk \in \mathcal{L}, associate a unique column.
  • For sample ii and leaf kk, set Qik=qi,t(k)I(â„“i(t(k))=k)Q_{ik} = q_{i, t(k)} I(\ell_i(t(k)) = k) and Wjk=wj,t(k)I(â„“j(t(k))=k)W_{jk} = w_{j, t(k)} I(\ell_j(t(k)) = k), where t(k)t(k) identifies the tree containing leaf kk.

With these definitions, the SWLCP matrix factorizes as

P=QW⊤,P = Q W^\top,

with each row of QQ and WW containing at most one nonzero per tree (and thus at most TT nonzeros per row). This formulation restricts computation to leaf-level collisions and avoids the O(N2)\mathcal{O}(N^2) cost of explicit pairwise comparisons.

Proximity variant qi,tq_{i,t} wj,tw_{j,t}
RF (original) $1/T$ $1$
RF-GAP I[t∈OOB(i)]/∣OOB(i)∣I[t \in \text{OOB}(i)]/|\text{OOB}(i)| cj(t)/∣Mj(t)∣c_j(t)/|M_j(t)|
GBT wt/∑kwkw_t/\sum_k w_k $1$

3. Computational Workflow and Pseudocode

The scalable computation of SWLCPs proceeds through the following steps:

  1. Leaf-to-Column Mapping: Enumerate all unique leaves across all trees and map each to a unique column index, yielding KK columns.
  2. Sparse Matrix Construction: Build index and data arrays for QQ and WW with nonzero entries determined by the sample-leaf assignments and the respective qitq_{it}, wjtw_{jt}.
  3. Sparse Matrix Multiplication: Form QQ and WW as CSR/CSC matrices and compute P=QW⊤P = QW^\top using optimized sparse linear algebra routines.

A high-level pseudocode summary is as follows:

1
2
3
4
5
1. For each tree t and each leaf in t, map (t, leaf_id) → column index.
2. For each sample i and tree t:
   - Assign q[i, t] to Q[i, k], and w[i, t] to W[i, k] where k = mapping[(t, leaf_id)].
3. Construct Q and W as sparse matrices.
4. Compute P = Q.dot(W.T).

By construction, nnz(Q)=nnz(W)=Nâ‹…Tnnz(Q) = nnz(W) = N \cdot T.

4. Computational Complexity Analysis

The classical explicit pairwise approach evaluating all N2N^2 sample pairs per tree incurs O(TN2)\mathcal{O}(T N^2) computational cost and O(N2)\mathcal{O}(N^2) memory. The sparse factorization method introduced for SWLCPs reduces this to near-linear in NN:

  • Tree traversal and leaf encoding: O(TNh)\mathcal{O}(T N h), where hh is average tree height.
  • Sparse matrix assembly: O(NT)\mathcal{O}(N T).
  • Sparse multiplication: Each Q[i,:]Q[i,:] has TT nonzeros; each P[i,:]P[i,:] contains at most Tâ„“T \ell nonzeros (for average leaf size â„“\ell), leading to O(NTâ„“)\mathcal{O}(N T \ell).
  • Total time complexity: O(TNh+NTâ„“)\mathcal{O}(T N h + N T \ell).

Memory usage is dominated by the storage of QQ, WW, and sparse PP, with O(NT)O(N T) and O(NTâ„“)O(N T \ell) nonzeros, respectively, as opposed to O(N2)O(N^2) for dense approaches.

5. Illustrative Construction: Toy Example

Consider N=3N=3 samples and T=2T=2 trees with leaves {A,B}\{A, B\} and {C,D}\{C, D\}. Assigning samples as (A,C)(A,C), (B,C)(B,C), and (A,D)(A,D), and using the RF proximity (qi,t=1/2q_{i,t} = 1/2, wj,t=1w_{j,t} = 1), the $4$ columns of QQ and WW encode sample-leaf memberships. The resulting P=QW⊤P=Q W^\top is:

P=(10.50.5 0.510 0.501)P = \begin{pmatrix} 1 & 0.5 & 0.5 \ 0.5 & 1 & 0 \ 0.5 & 0 & 1 \end{pmatrix}

This corresponds to the fraction of trees for which sample pairs share a leaf.

6. Practical Implementation Using Python

Efficient construction of SWLCPs leverages NumPy and SciPy's sparse matrix routines. The primary operations consist of assigning sample-leaf-weight triples to sparse index/value arrays, constructing QQ and WW as CSR matrices, and computing their product. The following fragment demonstrates core code elements:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
import scipy.sparse as sp

N, T = leaf_ids.shape
unique_leaves = {}
col = 0
for t in range(T):
    for leaf in np.unique(leaf_ids[:, t]):
        unique_leaves[(t, leaf)] = col
        col += 1
K = col
rows_Q, cols_Q, data_Q = [], [], []
rows_W, cols_W, data_W = [], [], []
for i in range(N):
    for t in range(T):
        leaf = leaf_ids[i, t]
        k = unique_leaves[(t, leaf)]
        rows_Q.append(i); cols_Q.append(k); data_Q.append(q_weights[i, t])
        rows_W.append(i); cols_W.append(k); data_W.append(w_weights[i, t])
Q = sp.csr_matrix((data_Q, (rows_Q, cols_Q)), shape=(N, K))
W = sp.csr_matrix((data_W, (rows_W, cols_W)), shape=(N, K))
P_sparse = Q.dot(W.T)

This organization ensures that only a linear number of nonzeros (NTN T) are handled in sparse arrays and products, and dense N×NN \times N structures are never explicitly formed unless required.

7. Empirical Performance and Scalability

Empirical evaluation on the Fashion-MNIST dataset (N≤70,000N \leq 70{,}000) demonstrates the practical impact of the SWLCP factorization strategy (Aumon et al., 6 Jan 2026):

  • Runtime: Traditional pairwise computation scales quadratically, exceeding 20 minutes for N=70,000N=70{,}000. The SWLCP method achieves near-linear scaling (empirical exponent m≈1.1m \approx 1.1), requiring ≈3\approx 3 seconds for the same NN.
  • Memory: The explicit approach exhausts memory at N≈50,000N \approx 50{,}000–60,00060{,}000; the SWLCP approach remains below 4 GB for N=70,000N=70{,}000, scaling linearly due to O(NTâ„“)O(N T \ell) storage.

This confirms that restricting to leaf-level collisions and utilizing separable weighting reduces real-world computational cost from intractable O(N2)\mathcal{O}(N^2) to essentially linear in NN.


For detailed derivations, algorithmic improvements, and further results, see "Scalable Tree Ensemble Proximities in Python" (Aumon et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Separable Weighted Leaf-Collision Proximities.