Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems (1907.12931v1)

Published 27 Jul 2019 in cs.DC, cs.CE, cs.PF, and q-bio.GN

Abstract: Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. Large sequencing centers typically employ hundreds of such systems. Such high-throughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing environment, usually deploying multicore processors. Since the application can be easily parallelized for distributed memory systems, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of these kernels by 1) improving cache reuse, 2) simplifying the algorithms, 3) replacing small fragmented memory allocations with a few large contiguous ones, 4) software prefetching, and 5) SIMD utilization wherever applicable - and massive reorganization of the source code enabling these improvements. As a result, we achieved nearly 2x, 183x, and 8x speedups on the three kernels, respectively, resulting in up to 3.5x and 2.4x speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM.

Citations (1,302)

View on Semantic Scholar

Summary

The paper introduces architecture-aware optimizations for BWA-MEM that enhance key computational kernels while preserving output consistency.
Methodological improvements like cache reuse, prefetching, and SIMD vectorization drive up to 183× speedup in critical sequence mapping routines.
The optimized BWA-MEM enables immediate deployment in high-throughput sequencing environments by significantly reducing processing times.

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

Overview

The paper "Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems" details a comprehensive approach to optimizing the widely used BWA-MEM software for sequence mapping on multicore processors. Authored by Vasimuddin Md, Sanchit Misra, Heng Li, and Srinivas Aluru, the work primarily focuses on substantial performance improvements without altering the software's output, making it a drop-in replacement for existing users. The authors target three primary computational kernels of BWA-MEM: Super Maximal Exact Matches (SMEM), Suffix Array Lookup (SAL), and the Banded Smith-Waterman (BSW) algorithm. These kernels constitute more than 85% of the overall computational load in the sequence mapping process.

Key Contributions

The paper introduces several architecture-aware optimizations to enhance the performance of these kernels and consequently the entire BWA-MEM application:

SMEM Kernel Optimization:
- Cache Reuse and Memory Access: The authors employ techniques such as minimizing small memory allocations and utilizing larger contiguous memory blocks to improve cache reuse and hardware prefetching.
- Simplification and Prefetching: The kernel algorithms are simplified, and software prefetching tactics are applied to improve memory access patterns. These enhancements led to a $2\times$ speedup in the SMEM kernel.
SAL Kernel Optimization:
- Optimization for Suffix Arrays: The authors eliminate the compressed suffix array used in the original BWA-MEM and utilize uncompressed suffix arrays, simplifying and accelerating lookup operations. This modification results in a remarkable $183\times$ speedup in the SAL kernel.
BSW Kernel Optimization:
- SIMD Utilization: The BSW algorithm is vectorized using Single Instruction Multiple Data (SIMD) operations. Inter-task vectorization is implemented due to the inherent irregularity in matrix sizes and the computation footprint of BSW. This transformation results in an $8\times$ speedup in the BSW kernel.
Code Reorganization and Memory Management:
- Workflow Reorganization: The workflow of BWA-MEM is reorganized to process batches of reads through all computations stages before moving to the next batch, rather than processing each read entirely before starting the next.
- Memory Management: Better memory management techniques are introduced, consolidating small fragmented memory allocations into larger contiguous blocks to improve hardware prefetch efficiency and cache reuse.

Results and Implications

The authors report substantial improvements in the performance of each kernel along with the overall application. Specifically, the optimized BWA-MEM shows up to $3.5\times$ and $2.4\times$ speedups on single-thread and single-socket execution of an Intel Xeon Skylake processor, respectively. These results are significant as they represent the highest reported single-CPU performance gains for BWA-MEM, facilitating faster genome sequencing data analysis.

Insights and Future Directions

The implications of this work are profound for both practical deployment and theoretical optimizations in bioinformatics:

Practical Deployment:
- The optimized BWA-MEM can be immediately deployed in high-throughput sequencing environments, significantly reducing computational times and resources.
- With identical output maintained, the improved BWA-MEM can seamlessly replace the existing version without requiring adaptations in downstream workflows.
Theoretical Advances:
- These optimizations underline the potential of architecture-aware enhancements, especially for irregular and data-intensive applications like sequence mapping.
- Future work could further explore memory latency reduction strategies and instruction count optimization, particularly for the SMEM and BSW kernels.

Conclusion

The paper successfully addresses the computational bottlenecks in BWA-MEM by leveraging architecture-aware optimizations and substantial refactoring of kernel routines. The resulting performance gains facilitate faster and more efficient sequence mapping, highlighting the importance of tailored computational strategies for bioinformatics applications. As genome sequencing continues to grow, such developments will be crucial for staying ahead of data generation capabilities.

The implementation is available as open-source, promoting wider adoption and potential further optimizations to maintain BWA-MEM's relevance in future genomics research and applications.

PDF Markdown