Papers
Topics
Authors
Recent
Search
2000 character limit reached

Branchless SIMD in B^S-tree

Updated 24 January 2026
  • The paper introduces a branchless SIMD approach in B^S-trees, eliminating conditional branches to streamline node search operations.
  • It presents a novel methodology leveraging SIMD instructions to traverse tree nodes in parallel, improving throughput and accuracy.
  • Empirical results reveal marked reductions in latency and enhanced scalability compared to conventional B-tree implementations.

Parallel Voting Decision Tree (PV-Tree) is a distributed algorithm designed to efficiently construct decision trees in large-scale, data-parallel environments. The core objective is to minimize communication costs, especially in contexts such as gradient boosting decision trees (GBDT) and random forests, where repeated computation of attribute histograms can otherwise overwhelm distributed systems. PV-Tree introduces a two-stage voting mechanism—local and global voting—to identify promising attributes for splitting, enabling near-optimal accuracy while keeping communication independent of the total number of attributes or dataset size. The theoretical foundation establishes probabilistic guarantees for identifying the globally best split, and empirical evaluation demonstrates favorable trade-offs between accuracy and efficiency relative to standard data-parallel and attribute-parallel baselines (Meng et al., 2016).

1. Problem Formulation and Setting

Consider a dataset D={(xi,yi)}i=1ND = \{(x_i, y_i)\}_{i=1}^N where xiRdx_i \in \mathbb{R}^d, yiYy_i \in \mathcal{Y}, and the goal is to construct a decision tree in parallel across MM machines. The dataset is partitioned horizontally into D=m=1MDmD = \bigcup_{m=1}^{M} D_m, with Dm=n=N/M|D_m| = n = N/M. At each node OO of the current tree, the challenge is to identify the attribute j[d]j^* \in [d] and threshold wWjw^* \in W_{j^*} that optimize an informativeness criterion Δj(w;O)\Delta_{j}(w; O). For classification, this is typically information gain: IGj(w;O)=H(YO)[PLH(YXj<w)+PRH(YXjw)]\text{IG}_j(w; O) = H(Y|O) - [P_L H(Y|X_j < w) + P_R H(Y|X_j \geq w)] and for regression, variance gain. PLP_L and PRP_R are split probabilities, and H()H(\cdot|\cdot) is conditional entropy (Meng et al., 2016).

2. Algorithmic Workflow: PV-Tree in Detail

PV-Tree operates by iteratively computing splits at each active tree node using three main phases:

2.1 Local Voting Phase

Each machine constructs histograms Hm,jH_{m,j} for all attributes jj over its local subset DmD_m, optionally pre-binning continuous features into BB bins to allow fast evaluation. The best local gain per attribute is computed: Δm,j=maxwΔm,j(w;Dm)\Delta_{m,j}^* = \max_{w} \Delta_{m,j}(w; D_m) The local top-kk attributes are selected based on Δm,j\Delta_{m,j}^*, yielding sets Sm[d]S_m \subset [d], Sm=k|S_m|=k for each machine.

2.2 Global Voting Phase

Each machine broadcasts its local candidate set SmS_m (e.g., via MPI_AllGather). For each attribute jj, aggregate the vote count fj={m:jSm}f_j = |\{ m : j \in S_m \}|. The $2k$ attributes with highest fjf_j are selected to form the global candidate set GG.

2.3 Final Split Selection Phase

For each jGj \in G, all machines send their local histograms Hm,jH_{m,j} (of BB bins) to the master (or combine via all-reduce). The master aggregates these to form Hj=m=1MHm,jH_j = \sum_{m=1}^{M} H_{m,j} for each attribute in GG. A scan over the binned histograms identifies the globally best split (j,w)=argmaxjG,wΔj(w;Hj)(j^*, w^*) = \arg\max_{j \in G, w} \Delta_j(w; H_j).

The key pseudocode is as follows:

1
2
3
4
5
6
7
8
9
Algorithm PV-Tree_FindBestSplit(D_m on m=1…M, k)
 1. Locally on each m:
    – Build H_{m,j} for j=1…d.
    – For each j let Δ_{m,j}^* = max_w Δ_j(w;H_{m,j}).
    – S_m ← top‐k indices by Δ_{m,j}^*.
 2. AllGather({S_m}_{m=1}^M) → {f_j}.
 3. G ← the 2k indices with largest f_j.
 4. Each m sends {H_{m,j}: j∈G} to master.  Master forms H_j = ∑_m H_{m,j}.
 5. Return (j*,w*) = arg max_{j∈G, w} Δ_j(w;H_j).
(Meng et al., 2016)

3. Communication Complexity Analysis

The PV-Tree communication protocol is designed so that its total per-split communication is O(MkB)O(MkB), where MM is the number of machines, kk is the number of local candidates per machine, and BB is the histogram bin count. This is achieved as follows:

  • Local voting: Each of the MM machines transmits kk attribute indices (O(Mk)O(Mk) words).
  • Histogram gathering: Each mm sends $2k$ histograms with BB bins (O(M2kB)O(M \cdot 2k \cdot B) words).

Thus, the total cost is O(MkB)O(MkB) per split—independent of dd (the feature dimension) and dataset size NN. In contrast, baseline data-parallel requires O(MdB)O(MdB), and attribute-parallel approaches may require O(N)O(N). Empirical measurements in large datasets (e.g., N=1×109,d=1200N=1 \times 10^9, d=1200) show that PV-Tree with k=15k=15 reduces communication to $10$ MB per full tree (depth=6), far less than the $750$ MB (attribute-parallel) or $424$ MB (data-parallel) required by alternatives (Meng et al., 2016).

4. Theoretical Accuracy Guarantees

PV-Tree incorporates probabilistic guarantees of optimal split selection. Let IG(1)IG(d)\mathrm{IG}_{(1)} \geq \cdots \geq \mathrm{IG}_{(d)} denote the true global information gains. Define l(j)(k)=IG(1)IG(j)/2l_{(j)}(k) = |\mathrm{IG}_{(1)} - \mathrm{IG}_{(j)}|/2 for j>kj > k, and

δ(j)(n,k)=α(j)(n)+4exp(c(j)nl(j)(k)2)\delta_{(j)}(n,k) = \alpha_{(j)}(n) + 4\exp(-c_{(j)} n l_{(j)}(k)^2)

where α(j)(n)0\alpha_{(j)}(n)\to 0 as nn\to\infty, and c(j)>0c_{(j)} > 0 constants. The probability that PV-Tree with MM machines, local sample size nn, kk local votes, and $2k$ global votes selects the best attribute is at least

m=M/2+1M(Mm)[1j=k+1dδ(j)]m[j=k+1dδ(j)]Mm\sum_{m=\lfloor M/2 \rfloor +1}^M \binom{M}{m} [1-\sum_{j=k+1}^d \delta_{(j)}]^m [\sum_{j=k+1}^d \delta_{(j)}]^{M-m}

which approaches $1$ as nn \to \infty. The guarantee follows from uniform convergence of local gain estimates (VC-type analysis) and the fact that majority voting with set size $2k$ preserves the top-scoring attribute with high probability (Meng et al., 2016).

5. Empirical Evaluation and Comparative Performance

PV-Tree was benchmarked inside GBDT on real-world learning-to-rank (LTR) and click-through rate (CTR) tasks:

Task #Training #Test d Machines PV-Tree Time Data-Parallel Attr-Parallel Sequential
LTR 11M 1M 1200 8 5,825 s 32,260 s 14,660 s 28,690 s
CTR 235M 31M 800 32 5,349 s 9,209 s 26,928 s 154,112 s

PV-Tree achieves significant speed-ups (e.g., 4.9×4.9\times and 28.8×28.8\times over sequential for LTR and CTR, respectively). Communication costs, measured for a depth-6 tree on N=1×109N=1\times10^9, d=1200d=1200, are $10$ MB for PV-Tree, compared to $750$ MB (attribute-parallel) and $424$ MB (data-parallel). In terms of accuracy, PV-Tree matches or exceeds these baselines for practical values of kk (5k205 \leq k \leq 20) and sufficiently large nn (Meng et al., 2016).

6. Design Trade-offs and Scalability Considerations

PV-Tree’s efficiency depends critically on the choice of kk and MM. Increasing the number of machines (MM) accelerates local computation but reduces per-machine sample size (nn), which can eventually impair local top-kk selection accuracy. Selecting kk too small can degrade global split quality, while too large a kk increases communication costs unnecessarily. Empirical evidence suggests k=5k=5 to $20$ is effective when nn is large. For a fixed NN, there is an optimal MM balancing computational and statistical efficiency (Meng et al., 2016).

7. Context and Relation to Other Parallel Decision Tree Methods

PV-Tree contrasts with attribute-parallel (vertical partitioning) approaches, which partition features across machines but require reshuffling all samples from the best attribute’s split at each split, incurring a communication cost proportional to O(N)O(N). Data-parallel approaches, which aggregate all histograms across attributes, incur O(MdB)O(MdB) communication, scaling linearly with dd. PV-Tree uniquely achieves communication O(MkB)O(MkB), decoupling network cost from dd or NN. The theoretical and empirical findings position PV-Tree as an effective solution for distributed tree induction in both high-dimensional and large-scale regimes (Meng et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Branchless SIMD in B$^S$-tree.