FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores (2311.05908v1)

Published 10 Nov 2023 in cs.LG

Abstract: Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93$\times$ over PyTorch and achieves up to 4.4$\times$ speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.

Authors (4)

Daniel Y. Fu (25 papers)
Hermann Kumbong (5 papers)
Eric Nguyen (11 papers)
Christopher Ré (194 papers)

Citations (23)

View on Semantic Scholar

Summary

The paper introduces Monarch decomposition, recasting FFTs as matrix multiplications to fully utilize tensor cores.
It integrates novel sparse convolution algorithms, including partial and frequency-sparse methods, to optimize memory and computational efficiency.
Empirical results demonstrate up to 7.93× speedup and notable improvements in language model perplexity and GLUE scores on long-sequence tasks.

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

The research presented in the paper introduces FlashFFTConv, an optimized system for employing Fast Fourier Transform (FFT) convolutions to enhance sequence modeling tasks. While convolution models with long filters are known for their reasoning capabilities, they often underperform compared to Transformers in terms of execution speed due to the inefficiencies inherent in current FFT implementations. The paper addresses this issue through a suite of advancements targeted at optimizing FFT convolutions for modern hardware accelerators.

Key Contributions

FlashFFTConv is developed to address the bottlenecks of FFT convolutions, notably the suboptimal hardware support that impairs its performance compared to Transformer models. A core contribution is the introduction of a Monarch decomposition, which enables matrix multiplication units to efficiently compute FFTs. This decomposition is explicitly designed to maximize the usage of tensor cores, thereby mitigating the hardware utilization challenges that traditional FFT implementations face.

Significant innovations include:

Monarch Decomposition: The methodological centerpiece of FlashFFTConv, Monarch decomposition, rewrites the FFT computation as a sequence of matrix-matrix multiplications, amenable to acceleration using tensor cores. This adaptation enhances kernel fusion, reducing the intensive input/output (I/O) operations typically necessary in FFT convolutions.
Sparse Convolution Algorithms: FlashFFTConv integrates two novel sparse convolution techniques: partial convolutions and frequency-sparse convolutions. These methods significantly conserve memory and computational resources. Partial convolutions shorten the kernel, and frequency-sparse convolutions zero out sections of the kernel in the frequency domain, analogous to sparse attention approaches.

Numerical Results

The paper provides compelling experimental evidence of the enhancements offered by FlashFFTConv:

Speed: FFT convolution operations realize speedups of up to 7.93× over the PyTorch baseline, while end-to-end sequence modeling tasks achieve speedups up to 4.4×.
Quality: On a fixed computational budget, FlashFFTConv allows for an enhancement in LLM perplexity by 2.3 points and a 3.3-point increase in GLUE scores compared to existing models of similar complexity.
Memory Efficiency: Partial convolutions facilitated modeling of extensive sequences, such as processing the longest human genes (2.3M base pairs), marking a notable scalability improvement.

Implications and Future Directions

FlashFFTConv represents a significant optimization over existing FFT convolution methods, particularly for long-sequence tasks. By improving computational efficiency and reducing memory demands, FlashFFTConv not only enables more effective utilization of existing hardware but also democratizes access to high-capacity sequence models. This innovation can drive future research in constructing more hardware-efficient AI models, particularly those exploring the interplay between sequence length and complexity in applications such as genomic sequencing or time-series analysis.

Furthermore, the introduction of sparse convolution techniques within the FFT framework broadens the scope for future algorithmic exploration, especially in domains demanding high efficiency and scalability. The methodological advances in FlashFFTConv promise to underpin further innovations in AI model design, potentially fostering new attention paradigms and convolutional techniques tailored for state-of-the-art hardware platforms.

In conclusion, FlashFFTConv constitutes a strategic advance in the optimization of convolutions for sequence modeling, paving the way for longer, faster, and more efficient models within machine learning and its application territories.

PDF Markdown

Related Papers

Tweets

https://twitter.com/srush_nlp/status/1783139519199314274