Branchless SIMD in B^S-tree

Updated 24 January 2026

The paper introduces a branchless SIMD approach in B^S-trees, eliminating conditional branches to streamline node search operations.
It presents a novel methodology leveraging SIMD instructions to traverse tree nodes in parallel, improving throughput and accuracy.
Empirical results reveal marked reductions in latency and enhanced scalability compared to conventional B-tree implementations.

Parallel Voting Decision Tree (PV-Tree) is a distributed algorithm designed to efficiently construct decision trees in large-scale, data-parallel environments. The core objective is to minimize communication costs, especially in contexts such as gradient boosting decision trees (GBDT) and random forests, where repeated computation of attribute histograms can otherwise overwhelm distributed systems. PV-Tree introduces a two-stage voting mechanism—local and global voting—to identify promising attributes for splitting, enabling near-optimal accuracy while keeping communication independent of the total number of attributes or dataset size. The theoretical foundation establishes probabilistic guarantees for identifying the globally best split, and empirical evaluation demonstrates favorable trade-offs between accuracy and efficiency relative to standard data-parallel and attribute-parallel baselines (Meng et al., 2016).

1. Problem Formulation and Setting

Consider a dataset $D = \{(x_i, y_i)\}_{i=1}^N$ where $x_i \in \mathbb{R}^d$ , $y_i \in \mathcal{Y}$ , and the goal is to construct a decision tree in parallel across $M$ machines. The dataset is partitioned horizontally into $D = \bigcup_{m=1}^{M} D_m$ , with $|D_m| = n = N/M$ . At each node $O$ of the current tree, the challenge is to identify the attribute $j^* \in [d]$ and threshold $w^* \in W_{j^*}$ that optimize an informativeness criterion $\Delta_{j}(w; O)$ . For classification, this is typically information gain: $\text{IG}_j(w; O) = H(Y|O) - [P_L H(Y|X_j < w) + P_R H(Y|X_j \geq w)]$ and for regression, variance gain. $P_L$ and $P_R$ are split probabilities, and $H(\cdot|\cdot)$ is conditional entropy (Meng et al., 2016).

2. Algorithmic Workflow: PV-Tree in Detail

PV-Tree operates by iteratively computing splits at each active tree node using three main phases:

2.1 Local Voting Phase

Each machine constructs histograms $H_{m,j}$ for all attributes $j$ over its local subset $D_m$ , optionally pre-binning continuous features into $B$ bins to allow fast evaluation. The best local gain per attribute is computed: $\Delta_{m,j}^* = \max_{w} \Delta_{m,j}(w; D_m)$ The local top- $k$ attributes are selected based on $\Delta_{m,j}^*$ , yielding sets $S_m \subset [d]$ , $|S_m|=k$ for each machine.

2.2 Global Voting Phase

Each machine broadcasts its local candidate set $S_m$ (e.g., via MPI_AllGather). For each attribute $j$ , aggregate the vote count $f_j = |\{ m : j \in S_m \}|$ . The $2k$ attributes with highest $f_j$ are selected to form the global candidate set $G$ .

2.3 Final Split Selection Phase

For each $j \in G$ , all machines send their local histograms $H_{m,j}$ (of $B$ bins) to the master (or combine via all-reduce). The master aggregates these to form $H_j = \sum_{m=1}^{M} H_{m,j}$ for each attribute in $G$ . A scan over the binned histograms identifies the globally best split $(j^*, w^*) = \arg\max_{j \in G, w} \Delta_j(w; H_j)$ .

The key pseudocode is as follows:

Algorithm PV-Tree_FindBestSplit(D_m on m=1…M, k)
 1. Locally on each m:
    – Build H_{m,j} for j=1…d.
    – For each j let Δ_{m,j}^* = max_w Δ_j(w;H_{m,j}).
    – S_m ← top‐k indices by Δ_{m,j}^*.
 2. AllGather({S_m}_{m=1}^M) → {f_j}.
 3. G ← the 2k indices with largest f_j.
 4. Each m sends {H_{m,j}: j∈G} to master.  Master forms H_j = ∑_m H_{m,j}.
 5. Return (j*,w*) = arg max_{j∈G, w} Δ_j(w;H_j).

(Meng et al., 2016)

3. Communication Complexity Analysis

The PV-Tree communication protocol is designed so that its total per-split communication is $O(MkB)$ , where $M$ is the number of machines, $k$ is the number of local candidates per machine, and $B$ is the histogram bin count. This is achieved as follows:

Local voting: Each of the $M$ machines transmits $k$ attribute indices ( $O(Mk)$ words).
Histogram gathering: Each $m$ sends $2k$ histograms with $B$ bins ( $O(M \cdot 2k \cdot B)$ words).

Thus, the total cost is $O(MkB)$ per split—independent of $d$ (the feature dimension) and dataset size $N$ . In contrast, baseline data-parallel requires $O(MdB)$ , and attribute-parallel approaches may require $O(N)$ . Empirical measurements in large datasets (e.g., $N=1 \times 10^9, d=1200$ ) show that PV-Tree with $k=15$ reduces communication to $10$ MB per full tree (depth=6), far less than the $750$ MB (attribute-parallel) or $424$ MB (data-parallel) required by alternatives (Meng et al., 2016).

4. Theoretical Accuracy Guarantees

PV-Tree incorporates probabilistic guarantees of optimal split selection. Let $\mathrm{IG}_{(1)} \geq \cdots \geq \mathrm{IG}_{(d)}$ denote the true global information gains. Define $l_{(j)}(k) = |\mathrm{IG}_{(1)} - \mathrm{IG}_{(j)}|/2$ for $j > k$ , and

$\delta_{(j)}(n,k) = \alpha_{(j)}(n) + 4\exp(-c_{(j)} n l_{(j)}(k)^2)$

where $\alpha_{(j)}(n)\to 0$ as $n\to\infty$ , and $c_{(j)} > 0$ constants. The probability that PV-Tree with $M$ machines, local sample size $n$ , $k$ local votes, and $2k$ global votes selects the best attribute is at least

$\sum_{m=\lfloor M/2 \rfloor +1}^M \binom{M}{m} [1-\sum_{j=k+1}^d \delta_{(j)}]^m [\sum_{j=k+1}^d \delta_{(j)}]^{M-m}$

which approaches $1$ as $n \to \infty$ . The guarantee follows from uniform convergence of local gain estimates (VC-type analysis) and the fact that majority voting with set size $2k$ preserves the top-scoring attribute with high probability (Meng et al., 2016).

5. Empirical Evaluation and Comparative Performance

PV-Tree was benchmarked inside GBDT on real-world learning-to-rank (LTR) and click-through rate (CTR) tasks:

Task	#Training	#Test	d	Machines	PV-Tree Time	Data-Parallel	Attr-Parallel	Sequential
LTR	11M	1M	1200	8	5,825 s	32,260 s	14,660 s	28,690 s
CTR	235M	31M	800	32	5,349 s	9,209 s	26,928 s	154,112 s

PV-Tree achieves significant speed-ups (e.g., $4.9\times$ and $28.8\times$ over sequential for LTR and CTR, respectively). Communication costs, measured for a depth-6 tree on $N=1\times10^9$ , $d=1200$ , are $10$ MB for PV-Tree, compared to $750$ MB (attribute-parallel) and $424$ MB (data-parallel). In terms of accuracy, PV-Tree matches or exceeds these baselines for practical values of $k$ ( $5 \leq k \leq 20$ ) and sufficiently large $n$ (Meng et al., 2016).

6. Design Trade-offs and Scalability Considerations

PV-Tree’s efficiency depends critically on the choice of $k$ and $M$ . Increasing the number of machines ( $M$ ) accelerates local computation but reduces per-machine sample size ( $n$ ), which can eventually impair local top- $k$ selection accuracy. Selecting $k$ too small can degrade global split quality, while too large a $k$ increases communication costs unnecessarily. Empirical evidence suggests $k=5$ to $20$ is effective when $n$ is large. For a fixed $N$ , there is an optimal $M$ balancing computational and statistical efficiency (Meng et al., 2016).

7. Context and Relation to Other Parallel Decision Tree Methods

PV-Tree contrasts with attribute-parallel (vertical partitioning) approaches, which partition features across machines but require reshuffling all samples from the best attribute’s split at each split, incurring a communication cost proportional to $O(N)$ . Data-parallel approaches, which aggregate all histograms across attributes, incur $O(MdB)$ communication, scaling linearly with $d$ . PV-Tree uniquely achieves communication $O(MkB)$ , decoupling network cost from $d$ or $N$ . The theoretical and empirical findings position PV-Tree as an effective solution for distributed tree induction in both high-dimensional and large-scale regimes (Meng et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

A Communication-Efficient Parallel Algorithm for Decision Tree (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Branchless SIMD in B$^S$-tree.