Hierarchical storage management in user space for neuroimaging applications (2404.11556v1)
Abstract: Neuroimaging open-data initiatives have led to increased availability of large scientific datasets. While these datasets are shifting the processing bottleneck from compute-intensive to data-intensive, current standardized analysis tools have yet to adopt strategies that mitigate the costs associated with large data transfers. A major challenge in adapting neuroimaging applications for data-intensive processing is that they must be entirely rewritten. To facilitate data management for standardized neuroimaging tools, we developed Sea, a library that intercepts and redirects application read and write calls to minimize data transfer time. In this paper, we investigate the performance of Sea on three preprocessing pipelines implemented using standard toolboxes (FSL, SPM and AFNI), using three neuroimaging datasets of different sizes (OpenNeuro's ds001545, PREVENT-AD and the HCP dataset) on two high-performance computing clusters. Our results demonstrate that Sea provides large speedups (up to 32X) when the shared file system's (e.g. Lustre) performance is deteriorated. When the shared file system is not overburdened by other users, performance is unaffected by Sea, suggesting that Sea's overhead is minimal even in cases where its benefits are limited. Overall, Sea is beneficial, even when performance gain is minimal, as it can be used to limit the number of files created on parallel file systems.
- Gorgolewski, K. J. et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific data 3, 1–9 (2016).
- Halchenko, Y. O. et al. Datalad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software 6, 3262 (2021).
- Henschel, L. et al. Fastsurfer-a fast and accurate deep learning based neuroimaging pipeline. NeuroImage 219, 117012 (2020).
- Hoffmann, M. et al. Synthmorph: learning contrast-invariant registration without acquired images. IEEE transactions on medical imaging 41, 543–558 (2021).
- The WU-Minn Human Connectome Project: an overview. Neuroimage 80, 62–79 (2013).
- Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nature neuroscience 19, 1523 (2016).
- Amunts, K. et al. BigBrain: an ultrahigh-resolution 3D human brain model. Science 340, 1472–1475 (2013).
- Zaharia, M. et al. Apache Spark: a unified engine for big data processing. Comm. of the ACM 59, 56–65 (2016).
- Rocklin, M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proc. of the 14th Python in Science Conference, 130–136 (Citeseer, 2015).
- Pan-neuro: interactive computing at scale with brain datasets (2021).
- Freeman, J. et al. Mapping brain activity at scale with cluster computing. Nature methods 11, 941 (2014).
- Big data approaches for the analysis of large-scale fmri data using Apache Spark and GPU processing: a demonstration on resting-state fMRI data from the Human Connectome Project. Frontiers in neuroscience 9, 492 (2016).
- Gorgolewski, K. et al. Nipype: A flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in Neuroinformatics 5, 13 (2011). URL https://www.frontiersin.org/article/10.3389/fninf.2011.00013.
- Ooops: an innovative tool for io workload management on supercomputers. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), 486–493 (IEEE, 2020).
- Daley, C. et al. Performance characterization of scientific workflows for the optimal use of burst buffers. Future Generation Computer Systems 110, 468–480 (2020). URL https://www.sciencedirect.com/science/article/pii/S0167739X16308287.
- An ephemeral burst-buffer file system for scientific applications. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 807–818 (2016).
- Vef, M.-A. et al. Gekkofs - a temporary distributed file system for hpc applications. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), 319–324 (2018).
- Cesario, E. et al. The XtreemFS Architecture. Linux Tag (2007).
- Sea: A lightweight data-placement library for big data scientific computing. arXiv preprint arXiv:2207.01737 (2022).
- "learning naturalistic temporal structure in the posterior medial network" (2019).
- Tremblay-Mercier, J. et al. Open science datasets from PREVENT-AD, a longitudinal cohort of pre-symptomatic alzheimer’s disease. NeuroImage: Clinical 31, 102733 (2021). URL https://www.sciencedirect.com/science/article/pii/S2213158221001777.
- Fsl. Neuroimage 62, 782–790 (2012).
- Cox, R. W. Afni: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical research 29, 162–173 (1996).
- Spm. https://www.fil.ion.ucl.ac.uk/spm/software/.
- Age-preserved semantic memory and the crunch effect manifested as differential semantic control networks: An fmri study. Plos one 16, e0249948 (2021).