Performance comparison of Dask and Apache Spark on HPC systems for Neuroimaging (2406.01409v1)

Published 3 Jun 2024 in cs.DC

Abstract: The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic \HL{neuroimaging} applications to process the \SI{606}{\gibi\byte} BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.

Citations (1)

View on Semantic Scholar

Summary

The paper empirically compares Dask and Spark on HPC neuroimaging tasks, demonstrating that memory allocation is critical for avoiding performance degradation during intensive computations.
It benchmarks synthetic and real neuroimaging workflows, revealing that I/O bottlenecks and scheduling overhead can limit scalability on Lustre-based HPC systems.
Findings recommend selecting Dask for Python-centric workflows and optimizing Spark configurations, offering practical guidance for system improvements in computational neuroscience.

Comparative Performance Analysis of Dask and Apache Spark for Neuroimaging on HPC Systems

The research paper presents an empirical performance analysis of two prominent Big Data frameworks, Dask and Apache Spark, specifically in the context of neuroimaging workloads on High-Performance Computing (HPC) systems. The paper is motivated by the rapid expansion of data sizes in scientific disciplines, which necessitates the adoption of efficient data processing engines. Despite the availability of various engines, there is a lack of specific directives for selecting a suitable Big Data framework for such tasks.

Experimental Setup

The paper utilizes several neuroimaging applications to benchmark the performance of Dask and Spark, including:

Increment: A synthetic application designed to simulate independent data processes with high inter-worker communication.
Histogram: An application computing the frequency distribution of voxel intensities.
Kmeans: Applied to the BigBrain dataset to cluster voxel intensities, which involves significant inter-worker data shuffling.
BIDS App Example: Used for real neuroimaging workflow to compute brain volumes from MRI data.

These applications were executed on a dedicated HPC cluster featuring the Lustre file system, configuring both Dask and Spark to have equivalent resource allocations.

Key Findings

Performance Parity with Memory Considerations:
- The performance between Dask and Spark is comparable for most data-intensive applications, albeit with variations arising from specific configurations and workload characteristics. However, Spark exhibited significant memory usage issues during Kmeans processing, leading to worker restarts and increased task recomputation. Increasing the memory allocation notably improved Spark’s performance, suggesting that Spark may perform comparably to Dask if memory constraints are adequately managed.
I/O Bottleneck:
- Both engines experienced I/O bottlenecks when attempting concurrent file access on Lustre, indicating that parallel I/O capacity often limited application performance. This aligns with prior observations that the concurrent access speed to the file system surpasses network communication throughput, particularly affecting data-intensive applications like Increment and Histogram.
Scheduling Overhead:
- Dask showcased slightly higher scheduling overheads compared to Spark due to its Python-based implementation, which might limit its scalability at higher node counts. Nevertheless, these overheads balanced out against data transfer efficiencies, resulting in similar total execution times.
Impact of Language and Ecosystem Integration:
- Dask's integration within the Python scientific ecosystem, offering better compatibility with widely used libraries (e.g., NumPy, Pandas) and its simpler configuration process, makes it an excellent candidate for Python-centric applications in neuroimaging.
Dask's Superiority in Task Scheduling:
- For applications entailing high inter-worker communication, Dask often scheduled tasks more efficiently by initiating computation before completing all data reads, potentially mitigating I/O bottlenecks.

Recommendations and Future Directions

The findings suggest practical recommendations for researchers:

Select Dask for Python-based workflows, especially those within the ecosystems of scientific computing. Its strong integration with Python libraries and a more informative dashboard are beneficial.
Optimize Spark configurations to avoid memory overruns by adopting strategies to adjust memory allocation dynamically per worker/node.
Improve infrastructure to reduce bottlenecks by exploring high-performance networking and more scalable file systems, mitigating the tradeoff between computing and data throughput to further elevate Big Data application performance.

Conclusion

In summary, the paper contributes a nuanced performance assessment of Dask and Apache Spark, offering insights into their respective strengths and going beyond surface-level benchmarking. While both frameworks demonstrate potential for neuroimaging data processing, the choice would largely depend on language preference, system configurations, and specific application needs in the computational neuroscience domain. Future research should focus on developing strategies to further mitigate I/O limitations and memory usage while improving the frameworks' adaptability to varying HPC configurations.

PDF Markdown