Parallel Voting Decision Tree (PV-Tree)
- PV-Tree is a distributed algorithm that uses a two-tier voting protocol to efficiently select split attributes with reduced communication overhead.
- It operates in three phases—local voting, global voting, and final split selection—to build decision trees and ensemble variants on horizontally partitioned data.
- Empirical studies demonstrate significant speedups and lower transfer volumes compared to traditional data- and attribute-parallel methods, with strong theoretical guarantees.
Parallel Voting Decision Tree (PV-Tree) is a communication-efficient distributed algorithm for learning decision trees and their ensemble variants, such as Gradient Boosted Decision Trees (GBDT) and Random Forest, in large-scale settings. The core innovation of PV-Tree lies in its two-tiered voting protocol for attribute selection, which enables substantial reduction in communication overhead while preserving the statistical fidelity of split selection. PV-Tree is designed for horizontally partitioned data distributed across multiple compute nodes, with formal theoretical guarantees and empirical validation on industry-scale datasets (Meng et al., 2016).
1. Problem Setting and Objective
PV-Tree operates on a training dataset , partitioned horizontally across machines, such that each machine holds with . For each split at a tree node , the goal is to identify the attribute and split point maximizing a node-informativeness criterion, such as information gain (for classification) or variance reduction (for regression):
- Information Gain (IG):
- Variance Gain (VG):
with 0, 1, and 2 denoting conditional entropy and variance, respectively.
2. Algorithmic Framework
The PV-Tree algorithm finds the best split at each node via three main phases: local voting, global voting, and final split selection:
- Local Voting: Each machine computes the local gain 3 for every attribute 4, selecting the top-5 locally best attributes 6, using binning (typically 7 bins) for continuous features.
- Global Voting: All machines exchange their local top-8 sets 9 (e.g., using MPI_AllGather). Global vote counts 0 are computed for each attribute. The top-21 globally most-voted attributes are selected as 2.
- Final Split Selection: Each machine sends full-binned histograms 3 for 4 to a master (or all-reduce operation). The master aggregates global histograms 5 and scans for the split 6.
The three phases are formalized by the following pseudocode:
3. Communication Complexity and Scalability
The communication efficiency of PV-Tree is achieved by restricting the exchange of full-grained histograms to a subset of attributes:
- Local Voting: Each machine sends 7 attribute indices (8 words).
- Global Voting and Aggregation: Each machine sends 9 histograms of 0 bins each (1 words).
- Total: 2 words per split iteration, independent of the total number of attributes 3 or training instances 4.
Comparative communication costs for one split iteration (per node):
| Method | Communication per split | Dependent on 5? | Dependent on 6? |
|---|---|---|---|
| PV-Tree | 7 | No | No |
| Data-parallel | 8 | Yes | No |
| Attribute-parallel | 9 | No | Yes |
This configuration supports scalable and efficient distributed training, with empirical evidence of significant reductions in communication volume (e.g., 10MB for PV-Tree vs. 424MB for data-parallel when 0e9, 1, 2).
4. Theoretical Analysis and Accuracy Guarantees
PV-Tree provides formal probabilistic guarantees that its two-stage voting protocol will, with high probability, select the statistically optimal split attribute:
Given true information gain rankings 3, the probability that PV-Tree selects the best attribute satisfies
4
where for 5, 6 with 7, 8, 9 as 0. As 1, this probability approaches 1 for fixed 2, 3, and 4.
This result arises from a combination of concentration bounds comparing empirical and true information gain, and a combinatorial Majoritarian (binomial) argument for the global voting phase (Meng et al., 2016).
5. Empirical Performance and Trade-offs
Extensive experiments demonstrate the superior performance of PV-Tree in terms of wall-clock convergence and communication efficiency across industrial-scale tasks, specifically for GBDT learning:
Example summary (training on Gradient Boosted Decision Trees):
| Task | # Training | # Test | 5 | Machines | Sequential | Data-parallel | Attr.-parallel | PV-Tree |
|---|---|---|---|---|---|---|---|---|
| LTR | 11M | 1M | 1200 | 8 | 28,690 s | 32,260 s | 14,660 s | 5,825 s |
| CTR | 235M | 31M | 800 | 32 | 154,112 s | 9,209 s | 26,928 s | 5,349 s |
PV-Tree achieves up to 6 speedup over sequential training on LTR data using eight machines and 7 speedup on CTR using 32 machines. Communication cost analyses indicate an order-of-magnitude lower transfer volume relative to alternatives at equivalent accuracy.
A recognized trade-off is that increasing 8 reduces per-node data but increases parallelism; optimal 9 depends on fixed 0. Excessively small 1 may degrade accuracy, but 2 is sufficient with large 3. The design ensures statistical correctness is not compromised as communication is reduced.
6. Comparison to Other Parallel Decision Tree Frameworks
PV-Tree contrasts with both data-parallel and attribute-parallel (vertical) approaches:
- Data-parallel approaches exchange histogram data for all attributes, incurring 4 cost.
- Attribute-parallel methods require global reshuffling of example-indexed values for selected attributes, entailing 5 communication per split.
- Vertical Hoeffding Tree (VHT) (Kourtellis et al., 2016) achieves parallelization by distributing features across workers with a Model Aggregator architecture but is tailored for streaming scenarios and uses a Hoeffding bound-based split protocol.
The two-stage voting of PV-Tree leverages horizontal partitioning and candidate restriction, thus scaling efficiently to scenarios with high 6, large 7, and many compute nodes.
7. Application Domains and Limitations
PV-Tree is directly applicable to GBDT and Random Forest learning on large-scale tabular data. By sharply lowering communication costs, it enables distributed model induction in environments with limited network bandwidth or where high feature dimensionality would otherwise bottleneck attribute selection. A plausible implication is enhanced applicability on real-world tasks in advertising (CTR), ranking (LTR), and other domains with massive datapoints and features.
Known limitations include reduced per-node sample size as 8 increases, which may affect statistical power if not offset by larger 9 or 0. Careful tuning of 1 is requisite for balancing accuracy and communication efficiency. These characteristics position PV-Tree as a method with broad practical scalability, strong theoretical underpinnings, and empirical validation as a state-of-the-art approach in scalable parallel decision tree induction (Meng et al., 2016).