Separable Weighted Leaf-Collision Proximities

Updated 13 January 2026

The paper introduces SWLCP as a novel approach that leverages tree ensemble leaf-collision structures to define supervised similarity measures.
It employs separable, sample-local weighting and sparse matrix factorization to reduce the computational cost from quadratic to near-linear scaling.
Empirical results demonstrate that SWLCPs achieve significant improvements in runtime and memory usage on large-scale datasets.

Separable Weighted Leaf-Collision Proximities (SWLCPs) constitute a mathematically rigorous family of supervised similarity measures defined via the leaf co-occurrence structure in tree ensembles such as Random Forests and Gradient Boosted Trees. SWLCPs generalize the notion that tree ensembles induce proximities based on the extent to which sample pairs collide, i.e., are assigned to the same leaf, with collisions modulated through separable, sample-local weighting schemes. The SWLCP structure enables exact, scalable computation by leveraging sparse matrix factorization, circumventing the quadratic time or memory complexities inherent to traditional explicit pairwise proximity formulations (Aumon et al., 6 Jan 2026).

1. Formal Framework and Definition

Let $N$ denote the number of samples and $\mathcal{T} = \{1, \ldots, T\}$ the set of $T$ trees in the ensemble. For a sample $x_i$ and tree $t$ , denote by $\ell_i(t)$ the index of the unique leaf of tree $t$ containing $x_i$ . The Weighted Leaf-Collision Proximity (WLCP) matrix $P \in \mathbb{R}^{N \times N}$ is defined as

$P_{ij} = \sum_{t=1}^T w_{ijt} \cdot I(\ell_i(t) = \ell_j(t)),$

where $w_{ijt}$ is a collision weight and $I$ is the indicator function. A WLCP is termed separable if

$w_{ijt} = q_{it} \cdot w_{jt},$

for nonnegative, sample-local vectors $q_{\cdot t}, w_{\cdot t} \in \mathbb{R}^N$ , where $q_{it}$ and $w_{jt}$ are specific to individual samples and trees but independent of the paired sample index.

Common proximities, including those underlying classical Random Forests, are instances of the separable form. For example, setting $q_{it} = 1/T$ and $w_{jt} = 1$ recovers the original Random Forest (RF) proximity.

2. Sparse Matrix Factorization

The defining property of SWLCPs is that they admit an exact sparse matrix factorization, which enables scalable computation. Denote $\mathcal{L}$ as the set of all leaf nodes across all trees, with $K = |\mathcal{L}|$ . Construct sparse matrices $Q, W \in \mathbb{R}^{N \times K}$ as follows:

For each leaf $k \in \mathcal{L}$ , associate a unique column.
For sample $i$ and leaf $k$ , set $Q_{ik} = q_{i, t(k)} I(\ell_i(t(k)) = k)$ and $W_{jk} = w_{j, t(k)} I(\ell_j(t(k)) = k)$ , where $t(k)$ identifies the tree containing leaf $k$ .

With these definitions, the SWLCP matrix factorizes as

$P = Q W^\top,$

with each row of $Q$ and $W$ containing at most one nonzero per tree (and thus at most $T$ nonzeros per row). This formulation restricts computation to leaf-level collisions and avoids the $\mathcal{O}(N^2)$ cost of explicit pairwise comparisons.

Proximity variant	$q_{i,t}$	$w_{j,t}$
RF (original)	$1/T$	$1$
RF-GAP	$I[t \in \text{OOB}(i)]/\|\text{OOB}(i)\|$	$c_j(t)/\|M_j(t)\|$
GBT	$w_t/\sum_k w_k$	$1$

3. Computational Workflow and Pseudocode

The scalable computation of SWLCPs proceeds through the following steps:

Leaf-to-Column Mapping: Enumerate all unique leaves across all trees and map each to a unique column index, yielding $K$ columns.
Sparse Matrix Construction: Build index and data arrays for $Q$ and $W$ with nonzero entries determined by the sample-leaf assignments and the respective $q_{it}$ , $w_{jt}$ .
Sparse Matrix Multiplication: Form $Q$ and $W$ as CSR/CSC matrices and compute $P = QW^\top$ using optimized sparse linear algebra routines.

A high-level pseudocode summary is as follows:

1. For each tree t and each leaf in t, map (t, leaf_id) → column index.
2. For each sample i and tree t:
   - Assign q[i, t] to Q[i, k], and w[i, t] to W[i, k] where k = mapping[(t, leaf_id)].
3. Construct Q and W as sparse matrices.
4. Compute P = Q.dot(W.T).

By construction, $nnz(Q) = nnz(W) = N \cdot T$ .

4. Computational Complexity Analysis

The classical explicit pairwise approach evaluating all $N^2$ sample pairs per tree incurs $\mathcal{O}(T N^2)$ computational cost and $\mathcal{O}(N^2)$ memory. The sparse factorization method introduced for SWLCPs reduces this to near-linear in $N$ :

Tree traversal and leaf encoding: $\mathcal{O}(T N h)$ , where $h$ is average tree height.
Sparse matrix assembly: $\mathcal{O}(N T)$ .
Sparse multiplication: Each $Q[i,:]$ has $T$ nonzeros; each $P[i,:]$ contains at most $T \ell$ nonzeros (for average leaf size $\ell$ ), leading to $\mathcal{O}(N T \ell)$ .
Total time complexity: $\mathcal{O}(T N h + N T \ell)$ .

Memory usage is dominated by the storage of $Q$ , $W$ , and sparse $P$ , with $O(N T)$ and $O(N T \ell)$ nonzeros, respectively, as opposed to $O(N^2)$ for dense approaches.

5. Illustrative Construction: Toy Example

Consider $N=3$ samples and $T=2$ trees with leaves $\{A, B\}$ and $\{C, D\}$ . Assigning samples as $(A,C)$ , $(B,C)$ , and $(A,D)$ , and using the RF proximity ( $q_{i,t} = 1/2$ , $w_{j,t} = 1$ ), the $4$ columns of $Q$ and $W$ encode sample-leaf memberships. The resulting $P=Q W^\top$ is:

$P = \begin{pmatrix} 1 & 0.5 & 0.5 \ 0.5 & 1 & 0 \ 0.5 & 0 & 1 \end{pmatrix}$

This corresponds to the fraction of trees for which sample pairs share a leaf.

6. Practical Implementation Using Python

Efficient construction of SWLCPs leverages NumPy and SciPy's sparse matrix routines. The primary operations consist of assigning sample-leaf-weight triples to sparse index/value arrays, constructing $Q$ and $W$ as CSR matrices, and computing their product. The following fragment demonstrates core code elements:

import numpy as np
import scipy.sparse as sp

N, T = leaf_ids.shape
unique_leaves = {}
col = 0
for t in range(T):
    for leaf in np.unique(leaf_ids[:, t]):
        unique_leaves[(t, leaf)] = col
        col += 1
K = col
rows_Q, cols_Q, data_Q = [], [], []
rows_W, cols_W, data_W = [], [], []
for i in range(N):
    for t in range(T):
        leaf = leaf_ids[i, t]
        k = unique_leaves[(t, leaf)]
        rows_Q.append(i); cols_Q.append(k); data_Q.append(q_weights[i, t])
        rows_W.append(i); cols_W.append(k); data_W.append(w_weights[i, t])
Q = sp.csr_matrix((data_Q, (rows_Q, cols_Q)), shape=(N, K))
W = sp.csr_matrix((data_W, (rows_W, cols_W)), shape=(N, K))
P_sparse = Q.dot(W.T)

This organization ensures that only a linear number of nonzeros ( $N T$ ) are handled in sparse arrays and products, and dense $N \times N$ structures are never explicitly formed unless required.

7. Empirical Performance and Scalability

Empirical evaluation on the Fashion-MNIST dataset ( $N \leq 70{,}000$ ) demonstrates the practical impact of the SWLCP factorization strategy (Aumon et al., 6 Jan 2026):

Runtime: Traditional pairwise computation scales quadratically, exceeding 20 minutes for $N=70{,}000$ . The SWLCP method achieves near-linear scaling (empirical exponent $m \approx 1.1$ ), requiring $\approx 3$ seconds for the same $N$ .
Memory: The explicit approach exhausts memory at $N \approx 50{,}000$ – $60{,}000$ ; the SWLCP approach remains below 4 GB for $N=70{,}000$ , scaling linearly due to $O(N T \ell)$ storage.

This confirms that restricting to leaf-level collisions and utilizing separable weighting reduces real-world computational cost from intractable $\mathcal{O}(N^2)$ to essentially linear in $N$ .

For detailed derivations, algorithmic improvements, and further results, see "Scalable Tree Ensemble Proximities in Python" (Aumon et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Scalable Tree Ensemble Proximities in Python (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Separable Weighted Leaf-Collision Proximities.

Separable Weighted Leaf-Collision Proximities

1. Formal Framework and Definition

2. Sparse Matrix Factorization

3. Computational Workflow and Pseudocode

4. Computational Complexity Analysis

5. Illustrative Construction: Toy Example

6. Practical Implementation Using Python

7. Empirical Performance and Scalability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Separable Weighted Leaf-Collision Proximities

1. Formal Framework and Definition

2. Sparse Matrix Factorization

3. Computational Workflow and Pseudocode

4. Computational Complexity Analysis

5. Illustrative Construction: Toy Example

6. Practical Implementation Using Python

7. Empirical Performance and Scalability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research