A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment (1810.07748v1)

Published 17 Oct 2018 in cs.DC

Abstract: With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm's accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability.

Citations (352)

View on Semantic Scholar

Summary

The paper introduces a novel PRF algorithm that integrates vertical data partitioning and dual-parallel training for efficient big data processing in Spark.
It achieves up to a 10.6% increase in classification accuracy and improved execution time across large-scale datasets compared to traditional RF methods.
The approach incorporates dimension reduction and weighted voting to enhance accuracy and scalability in distributed cloud computing environments.

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

The paper presents a Parallel Random Forest (PRF) algorithm tailored for big data processing within Apache Spark, addressing challenges associated with efficiently and accurately extracting valuable knowledge from voluminous datasets. The PRF algorithm is engineered using hybrid data-parallel and task-parallel optimization techniques, taking advantage of Spark's Resilient Distributed Dataset (RDD) and Directed Acyclic Graph (DAG) models to enhance performance and scalability.

Summary of Contributions

The primary contributions of the paper include:

Data-Parallel Optimization: The authors propose a vertical data-partitioning method and a data-multiplexing approach to minimize data communication costs and increase data reuse. By partitioning data vertically, the algorithm reduces cross-node data transfer within distributed environments. This method allows the reuse of training datasets, maintaining constant data volume irrespective of the RF model's scale.
Task-Parallel Optimization: A dual-parallel training approach is introduced, constructing decision trees concurrently, with each sub-node processed in parallel as separate tasks. Task schedulers further optimize the algorithm by leveraging Spark's node-local and cluster-global execution features, minimizing inter-node data transfer and maximizing computational resource utilization.
Dimension Reduction and Weighted Voting: The PRF algorithm incorporates a dimension-reduction strategy during training to target high-dimensional datasets, improving model efficiency without sacrificing accuracy. Furthermore, a weighted voting mechanism based on classification accuracy is utilized, enhancing the prediction accuracy over traditional direct voting methods used in RF algorithms.

Numerical Results and Claims

The paper substantiates its claims with extensive experimental results, highlighting the PRF algorithm's superiority over existing RF implementations, including those in Spark MLlib and DRF, in terms of classification accuracy. The empirical performance is clearly delineated via:

Improved classification accuracy, with up to a 10.6% increase compared to traditional RF, 7.3% against DRF, and 5.8% over Spark MLRF.
Enhanced execution time efficiency, especially pronounced with increasing dataset size and complexity, demonstrating PRF's adaptability and resource efficiency.
Significant reduction in data communication costs within distributed Spark environments due to a novel data allocation strategy paired with effective task scheduling.

Implications and Speculations

The practical implications of this work are profound, particularly in domains relying on rapid and accurate data insights from vast, high-dimensional datasets, such as genomics, climate modeling, and internet of things (IoT) analytics. Theoretically, the enhanced parallel processing model proposes an efficient approach to optimize ensemble learning tasks within cloud-based platforms.

Future directions may explore extending the PRF's capabilities to deal with incremental data streams in a real-time context, enhancing its relevance for dynamic, data-intensive applications. Moreover, research could refine resource allocation strategies and task scheduling mechanisms further, to push the boundaries of real-time processing and adaptive learning in distributed cloud environments.

This paper's PRF algorithm represents a significant advancement in cloud-based big data analytics, setting a robust foundation for future exploration and optimization within parallel and distributed learning frameworks.

PDF Markdown