Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning (2207.04978v1)

Published 11 Jul 2022 in cs.CV and cs.LG

Abstract: Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.

PDF Abstract

An Expert Overview of "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning"

The paper "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning" presents a novel approach to enhancing the efficiency of Vision Transformers (ViTs) by integrating wavelet transforms. This work addresses a fundamental computational challenge with Transformers in visual representation learning—the quadratic scaling of self-attention computation with respect to the number of input patches. The proposed solution, Wavelet Vision Transformer (Wave-ViT), leverages wavelet theory to introduce invertible down-sampling within the Transformer framework, ensuring information is preserved while reducing computational costs.

Theoretical and Architectural Contributions

Wave-ViT introduces a new architectural component termed the Wavelets block, which effectively integrates Discrete Wavelet Transform (DWT) with Transformer self-attention mechanisms. This integration is built on the premise that typical down-sampling operations, such as average pooling, result in information loss—particularly affecting high-frequency components crucial for capturing textural details. The paper posits that the use of wavelet transforms allows for lossless down-sampling, preserving essential image details while decreasing the number of operations required.

The Wavelets block operates by transforming input keys and values into four wavelet subbands using DWT, which are subsequently processed with convolution operations to impose spatial locality. Following self-attention computation, inverse DWT is applied to obtain a high-resolution feature map, which is combined with the self-attention output. This methodology not only maintains image details but also provides enhanced feature contextualization through enlarged receptive fields.

Empirical Results

The empirical evaluation of Wave-ViT demonstrates significant advancements over state-of-the-art ViT backbones across diverse vision tasks, including image recognition, object detection, and instance segmentation. Specifically, Wave-ViT achieves a top-1 accuracy of 85.5% on ImageNet for image recognition, surpassing Pyramid Vision Transformer (PVT) with an absolute improvement of 1.7%. In the domain of object detection and segmentation on the COCO dataset, Wave-ViT also exhibits superior performance with 1.3% and 0.5% mAP increases, respectively, while utilizing 25.9% fewer parameters compared to PVT.

The paper's framework makes bold claims regarding the trade-off between computational efficiency and model accuracy, facilitated by the wavelet-transformed, multi-scale vision architecture. These claims are substantiated through comprehensive experimental validation across different model sizes and numerous benchmarking tasks.

Implications and Future Directions

The integration of wavelet transforms in Transformer architectures opens new avenues for further research into efficient visual representation learning. The immediate practicality of Wave-ViT lies in its potential to enhance the representational capacity of vision models while conserving computational resources, making it particularly useful for high-resolution image processing tasks.

Future developments could explore the application of wavelet-augmented Transformers in other modalities, such as audio and time-series data, where multi-scale and frequency-preserving analyses are crucial. Moreover, extending this framework to support even finer granularity of multi-scale feature extraction and attention could drive further improvements in visual understanding tasks.

In conclusion, the "Wave-ViT" paper enriches the field of visual representation learning by unifying wavelet transforms with Transformer architectures, presenting a compelling case for this hybrid approach both theoretically and empirically. Its contributions not only demonstrate strong numerical results but also lay the groundwork for future innovations in AI model design and efficiency.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Ting Yao (127 papers)
Yingwei Pan (77 papers)
Yehao Li (35 papers)
Chong-Wah Ngo (55 papers)
Tao Mei (209 papers)

Citations (114)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - YehLi/ImageNetModel: Official ImageNet Model repository (191 stars)