A Communication-Efficient Parallel Algorithm for Decision Tree (1611.01276v1)

Published 4 Nov 2016 in cs.LG

Abstract: Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and model interpretability. With the emergence of big data, there is an increasing need to parallelize the training process of decision tree. However, most existing attempts along this line suffer from high communication costs. In this paper, we propose a new algorithm, called \emph{Parallel Voting Decision Tree (PV-Tree)}, to tackle this challenge. After partitioning the training data onto a number of (e.g., $M$) machines, this algorithm performs both local voting and global voting in each iteration. For local voting, the top-$k$ attributes are selected from each machine according to its local data. Then, globally top-$2k$ attributes are determined by a majority voting among these local candidates. Finally, the full-grained histograms of the globally top-$2k$ attributes are collected from local machines in order to identify the best (most informative) attribute and its split point. PV-Tree can achieve a very low communication cost (independent of the total number of attributes) and thus can scale out very well. Furthermore, theoretical analysis shows that this algorithm can learn a near optimal decision tree, since it can find the best attribute with a large probability. Our experiments on real-world datasets show that PV-Tree significantly outperforms the existing parallel decision tree algorithms in the trade-off between accuracy and efficiency.

Citations (112)

View on Semantic Scholar

Summary

The paper introduces the Parallel Voting Decision Tree (PV-Tree) algorithm, designed to efficiently train decision trees in parallel on large datasets by drastically reducing communication costs.
PV-Tree employs a two-stage local and global voting mechanism to select attributes, significantly reducing communication overhead compared to traditional parallel methods.
Experiments on real data show PV-Tree achieves significant speedup and maintains high accuracy, demonstrating its effectiveness and communication efficiency for large-scale distributed learning.

A Communication-Efficient Parallel Algorithm for Decision Tree

The paper presents a novel approach to parallelizing decision tree algorithms, addressing the challenges posed by big data scenarios. Decision trees, and their extensions like Gradient Boosting Decision Trees (GBDT) and Random Forest (RF), are prevalent due to their effectiveness and interpretability. However, existing parallelization strategies often encounter high communication costs, restricting their scalability. This work introduces the Parallel Voting Decision Tree (PV-Tree) algorithm, designed to mitigate these communication expenses while maintaining high accuracy.

Overview of Methodology

PV-Tree fundamentally differs from traditional data-parallel strategies. It employs a two-stage voting mechanism—local and global—to reduce communication needs significantly. The workflow involves:

Local Voting: Each machine computes the top- $k$ attributes locally based on informativeness scores, such as information gain for classification or variance gain for regression.
Global Voting: A majority voting process selects the globally top-$2k$ attributes from those identified locally. This ensures that the global communication only concerns a fraction of attributes, drastically lowering the data exchange overhead.
Best Attribute Identification: Full-grained histograms of the globally top-$2k$ attributes are collected across machines to pinpoint the best attribute and its split point. Notably, this step's communication cost is detached from the total attribute count.

Theoretical analyses assert that PV-Tree approaches optimal decision tree construction, with increasing training data size enhancing the probability of selecting the most informative attributes. The interplay of local and global voting orchestrates a balance between data parallelism and communication efficiency.

Experimental Validation and Results

Experiments carried out on real-world datasets, specifically in learning-to-rank and ad click prediction tasks, demonstrate that PV-Tree surpasses baseline algorithms in both speed and accuracy. For instance, on a learning-to-rank (LTR) task using eight machines, PV-Tree exhibited a fivefold speedup compared to the sequential algorithm while achieving equivalent accuracy. Its communication efficiency is evident when juxtaposed with attribute-parallel and conventional data-parallel methods, especially as those incur costs relative to data size or attribute number, which PV-Tree circumvents.

Implications and Future Directions

The introduction of PV-Tree has notable implications for large-scale data processing in distributed environments. It circumvents bottlenecks faced by existing methods, offering a scalable solution adaptable to the growing demands of big data workflows. The reduced communication cost also suggests potential applicability in resource-constrained environments.

The paper hints at extending PV-Tree's principles to other machine learning algorithms. Open-sourcing the algorithm could foster broader adoption and adaptation in diverse research and industrial contexts. Future research could also explore adaptive mechanisms for setting the $k$ parameter dynamically, depending on data characteristics, to further optimize the trade-off between communication and accuracy.

In conclusion, PV-Tree represents a substantial step forward in efficient decision tree training, striking a critical balance between scalability and computational overhead in distributed machine learning frameworks. It is a testament to the ongoing evolution in designing algorithms that cater to the ubiquitous challenge of processing vast datasets efficiently.

Related Papers

YouTube

Show All Videos