Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel Voting Decision Tree (PV-Tree)

Updated 24 January 2026
  • PV-Tree is a distributed algorithm that uses a two-tier voting protocol to efficiently select split attributes with reduced communication overhead.
  • It operates in three phases—local voting, global voting, and final split selection—to build decision trees and ensemble variants on horizontally partitioned data.
  • Empirical studies demonstrate significant speedups and lower transfer volumes compared to traditional data- and attribute-parallel methods, with strong theoretical guarantees.

Parallel Voting Decision Tree (PV-Tree) is a communication-efficient distributed algorithm for learning decision trees and their ensemble variants, such as Gradient Boosted Decision Trees (GBDT) and Random Forest, in large-scale settings. The core innovation of PV-Tree lies in its two-tiered voting protocol for attribute selection, which enables substantial reduction in communication overhead while preserving the statistical fidelity of split selection. PV-Tree is designed for horizontally partitioned data distributed across multiple compute nodes, with formal theoretical guarantees and empirical validation on industry-scale datasets (Meng et al., 2016).

1. Problem Setting and Objective

PV-Tree operates on a training dataset D={(xi,yi)}i=1ND = \{ (x_i, y_i) \}_{i=1}^N, partitioned horizontally across MM machines, such that each machine mm holds DmD_m with Dm=n=N/M|D_m| = n = N/M. For each split at a tree node OO, the goal is to identify the attribute jj^* and split point ww^* maximizing a node-informativeness criterion, such as information gain (for classification) or variance reduction (for regression):

  • Information Gain (IG):

IGj(w;O)=H(YO)[PLH(YXj<w)+PRH(YXjw)]IG_j(w;O) = H(Y|O) - [ P_L H(Y|X_j < w) + P_R H(Y|X_j \geq w) ]

  • Variance Gain (VG):

VGj(w;O)=Var(YO)[PLVar(YXj<w)+PRVar(YXjw)]VG_j(w;O) = \operatorname{Var}(Y|O) - [ P_L \operatorname{Var}(Y|X_j < w) + P_R \operatorname{Var}(Y|X_j \geq w) ]

with MM0, MM1, and MM2 denoting conditional entropy and variance, respectively.

2. Algorithmic Framework

The PV-Tree algorithm finds the best split at each node via three main phases: local voting, global voting, and final split selection:

  1. Local Voting: Each machine computes the local gain MM3 for every attribute MM4, selecting the top-MM5 locally best attributes MM6, using binning (typically MM7 bins) for continuous features.
  2. Global Voting: All machines exchange their local top-MM8 sets MM9 (e.g., using MPI_AllGather). Global vote counts mm0 are computed for each attribute. The top-2mm1 globally most-voted attributes are selected as mm2.
  3. Final Split Selection: Each machine sends full-binned histograms mm3 for mm4 to a master (or all-reduce operation). The master aggregates global histograms mm5 and scans for the split mm6.

The three phases are formalized by the following pseudocode:

ww^*2 (Meng et al., 2016)

3. Communication Complexity and Scalability

The communication efficiency of PV-Tree is achieved by restricting the exchange of full-grained histograms to a subset of attributes:

  • Local Voting: Each machine sends mm7 attribute indices (mm8 words).
  • Global Voting and Aggregation: Each machine sends mm9 histograms of DmD_m0 bins each (DmD_m1 words).
  • Total: DmD_m2 words per split iteration, independent of the total number of attributes DmD_m3 or training instances DmD_m4.

Comparative communication costs for one split iteration (per node):

Method Communication per split Dependent on DmD_m5? Dependent on DmD_m6?
PV-Tree DmD_m7 No No
Data-parallel DmD_m8 Yes No
Attribute-parallel DmD_m9 No Yes

(Meng et al., 2016)

This configuration supports scalable and efficient distributed training, with empirical evidence of significant reductions in communication volume (e.g., 10MB for PV-Tree vs. 424MB for data-parallel when Dm=n=N/M|D_m| = n = N/M0e9, Dm=n=N/M|D_m| = n = N/M1, Dm=n=N/M|D_m| = n = N/M2).

4. Theoretical Analysis and Accuracy Guarantees

PV-Tree provides formal probabilistic guarantees that its two-stage voting protocol will, with high probability, select the statistically optimal split attribute:

Given true information gain rankings Dm=n=N/M|D_m| = n = N/M3, the probability that PV-Tree selects the best attribute satisfies

Dm=n=N/M|D_m| = n = N/M4

where for Dm=n=N/M|D_m| = n = N/M5, Dm=n=N/M|D_m| = n = N/M6 with Dm=n=N/M|D_m| = n = N/M7, Dm=n=N/M|D_m| = n = N/M8, Dm=n=N/M|D_m| = n = N/M9 as OO0. As OO1, this probability approaches 1 for fixed OO2, OO3, and OO4.

This result arises from a combination of concentration bounds comparing empirical and true information gain, and a combinatorial Majoritarian (binomial) argument for the global voting phase (Meng et al., 2016).

5. Empirical Performance and Trade-offs

Extensive experiments demonstrate the superior performance of PV-Tree in terms of wall-clock convergence and communication efficiency across industrial-scale tasks, specifically for GBDT learning:

Example summary (training on Gradient Boosted Decision Trees):

Task # Training # Test OO5 Machines Sequential Data-parallel Attr.-parallel PV-Tree
LTR 11M 1M 1200 8 28,690 s 32,260 s 14,660 s 5,825 s
CTR 235M 31M 800 32 154,112 s 9,209 s 26,928 s 5,349 s

(Meng et al., 2016)

PV-Tree achieves up to OO6 speedup over sequential training on LTR data using eight machines and OO7 speedup on CTR using 32 machines. Communication cost analyses indicate an order-of-magnitude lower transfer volume relative to alternatives at equivalent accuracy.

A recognized trade-off is that increasing OO8 reduces per-node data but increases parallelism; optimal OO9 depends on fixed jj^*0. Excessively small jj^*1 may degrade accuracy, but jj^*2 is sufficient with large jj^*3. The design ensures statistical correctness is not compromised as communication is reduced.

6. Comparison to Other Parallel Decision Tree Frameworks

PV-Tree contrasts with both data-parallel and attribute-parallel (vertical) approaches:

  • Data-parallel approaches exchange histogram data for all attributes, incurring jj^*4 cost.
  • Attribute-parallel methods require global reshuffling of example-indexed values for selected attributes, entailing jj^*5 communication per split.
  • Vertical Hoeffding Tree (VHT) (Kourtellis et al., 2016) achieves parallelization by distributing features across workers with a Model Aggregator architecture but is tailored for streaming scenarios and uses a Hoeffding bound-based split protocol.

The two-stage voting of PV-Tree leverages horizontal partitioning and candidate restriction, thus scaling efficiently to scenarios with high jj^*6, large jj^*7, and many compute nodes.

7. Application Domains and Limitations

PV-Tree is directly applicable to GBDT and Random Forest learning on large-scale tabular data. By sharply lowering communication costs, it enables distributed model induction in environments with limited network bandwidth or where high feature dimensionality would otherwise bottleneck attribute selection. A plausible implication is enhanced applicability on real-world tasks in advertising (CTR), ranking (LTR), and other domains with massive datapoints and features.

Known limitations include reduced per-node sample size as jj^*8 increases, which may affect statistical power if not offset by larger jj^*9 or ww^*0. Careful tuning of ww^*1 is requisite for balancing accuracy and communication efficiency. These characteristics position PV-Tree as a method with broad practical scalability, strong theoretical underpinnings, and empirical validation as a state-of-the-art approach in scalable parallel decision tree induction (Meng et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Decision Tree Learning (PV-Tree).