Branchless SIMD in B^S-tree
- The paper introduces a branchless SIMD approach in B^S-trees, eliminating conditional branches to streamline node search operations.
- It presents a novel methodology leveraging SIMD instructions to traverse tree nodes in parallel, improving throughput and accuracy.
- Empirical results reveal marked reductions in latency and enhanced scalability compared to conventional B-tree implementations.
Parallel Voting Decision Tree (PV-Tree) is a distributed algorithm designed to efficiently construct decision trees in large-scale, data-parallel environments. The core objective is to minimize communication costs, especially in contexts such as gradient boosting decision trees (GBDT) and random forests, where repeated computation of attribute histograms can otherwise overwhelm distributed systems. PV-Tree introduces a two-stage voting mechanism—local and global voting—to identify promising attributes for splitting, enabling near-optimal accuracy while keeping communication independent of the total number of attributes or dataset size. The theoretical foundation establishes probabilistic guarantees for identifying the globally best split, and empirical evaluation demonstrates favorable trade-offs between accuracy and efficiency relative to standard data-parallel and attribute-parallel baselines (Meng et al., 2016).
1. Problem Formulation and Setting
Consider a dataset where , , and the goal is to construct a decision tree in parallel across machines. The dataset is partitioned horizontally into , with . At each node of the current tree, the challenge is to identify the attribute and threshold that optimize an informativeness criterion . For classification, this is typically information gain: and for regression, variance gain. and are split probabilities, and is conditional entropy (Meng et al., 2016).
2. Algorithmic Workflow: PV-Tree in Detail
PV-Tree operates by iteratively computing splits at each active tree node using three main phases:
2.1 Local Voting Phase
Each machine constructs histograms for all attributes over its local subset , optionally pre-binning continuous features into bins to allow fast evaluation. The best local gain per attribute is computed: The local top- attributes are selected based on , yielding sets , for each machine.
2.2 Global Voting Phase
Each machine broadcasts its local candidate set (e.g., via MPI_AllGather). For each attribute , aggregate the vote count . The $2k$ attributes with highest are selected to form the global candidate set .
2.3 Final Split Selection Phase
For each , all machines send their local histograms (of bins) to the master (or combine via all-reduce). The master aggregates these to form for each attribute in . A scan over the binned histograms identifies the globally best split .
The key pseudocode is as follows:
1 2 3 4 5 6 7 8 9 |
Algorithm PV-Tree_FindBestSplit(D_m on m=1…M, k)
1. Locally on each m:
– Build H_{m,j} for j=1…d.
– For each j let Δ_{m,j}^* = max_w Δ_j(w;H_{m,j}).
– S_m ← top‐k indices by Δ_{m,j}^*.
2. AllGather({S_m}_{m=1}^M) → {f_j}.
3. G ← the 2k indices with largest f_j.
4. Each m sends {H_{m,j}: j∈G} to master. Master forms H_j = ∑_m H_{m,j}.
5. Return (j*,w*) = arg max_{j∈G, w} Δ_j(w;H_j). |
3. Communication Complexity Analysis
The PV-Tree communication protocol is designed so that its total per-split communication is , where is the number of machines, is the number of local candidates per machine, and is the histogram bin count. This is achieved as follows:
- Local voting: Each of the machines transmits attribute indices ( words).
- Histogram gathering: Each sends $2k$ histograms with bins ( words).
Thus, the total cost is per split—independent of (the feature dimension) and dataset size . In contrast, baseline data-parallel requires , and attribute-parallel approaches may require . Empirical measurements in large datasets (e.g., ) show that PV-Tree with reduces communication to $10$ MB per full tree (depth=6), far less than the $750$ MB (attribute-parallel) or $424$ MB (data-parallel) required by alternatives (Meng et al., 2016).
4. Theoretical Accuracy Guarantees
PV-Tree incorporates probabilistic guarantees of optimal split selection. Let denote the true global information gains. Define for , and
where as , and constants. The probability that PV-Tree with machines, local sample size , local votes, and $2k$ global votes selects the best attribute is at least
which approaches $1$ as . The guarantee follows from uniform convergence of local gain estimates (VC-type analysis) and the fact that majority voting with set size $2k$ preserves the top-scoring attribute with high probability (Meng et al., 2016).
5. Empirical Evaluation and Comparative Performance
PV-Tree was benchmarked inside GBDT on real-world learning-to-rank (LTR) and click-through rate (CTR) tasks:
| Task | #Training | #Test | d | Machines | PV-Tree Time | Data-Parallel | Attr-Parallel | Sequential |
|---|---|---|---|---|---|---|---|---|
| LTR | 11M | 1M | 1200 | 8 | 5,825 s | 32,260 s | 14,660 s | 28,690 s |
| CTR | 235M | 31M | 800 | 32 | 5,349 s | 9,209 s | 26,928 s | 154,112 s |
PV-Tree achieves significant speed-ups (e.g., and over sequential for LTR and CTR, respectively). Communication costs, measured for a depth-6 tree on , , are $10$ MB for PV-Tree, compared to $750$ MB (attribute-parallel) and $424$ MB (data-parallel). In terms of accuracy, PV-Tree matches or exceeds these baselines for practical values of () and sufficiently large (Meng et al., 2016).
6. Design Trade-offs and Scalability Considerations
PV-Tree’s efficiency depends critically on the choice of and . Increasing the number of machines () accelerates local computation but reduces per-machine sample size (), which can eventually impair local top- selection accuracy. Selecting too small can degrade global split quality, while too large a increases communication costs unnecessarily. Empirical evidence suggests to $20$ is effective when is large. For a fixed , there is an optimal balancing computational and statistical efficiency (Meng et al., 2016).
7. Context and Relation to Other Parallel Decision Tree Methods
PV-Tree contrasts with attribute-parallel (vertical partitioning) approaches, which partition features across machines but require reshuffling all samples from the best attribute’s split at each split, incurring a communication cost proportional to . Data-parallel approaches, which aggregate all histograms across attributes, incur communication, scaling linearly with . PV-Tree uniquely achieves communication , decoupling network cost from or . The theoretical and empirical findings position PV-Tree as an effective solution for distributed tree induction in both high-dimensional and large-scale regimes (Meng et al., 2016).