MMViT: Multiscale Multiview Vision Transformers (2305.00104v1)
Abstract: We present Multiscale Multiview Vision Transformers (MMViT), which introduces multiscale feature maps and multiview encodings to transformer models. Our model encodes different views of the input signal and builds several channel-resolution feature stages to process the multiple views of the input at different resolutions in parallel. At each scale stage, we use a cross-attention block to fuse information across different views. This enables the MMViT model to acquire complex high-dimensional representations of the input at different resolutions. The proposed model can serve as a backbone model in multiple domains. We demonstrate the effectiveness of MMViT on audio and image classification tasks, achieving state-of-the-art results.
- Yuchen Liu (156 papers)
- Natasha Ong (1 paper)
- Kaiyan Peng (6 papers)
- Bo Xiong (84 papers)
- Qifan Wang (129 papers)
- Rui Hou (56 papers)
- Madian Khabsa (38 papers)
- Kaiyue Yang (2 papers)
- David Liu (32 papers)
- Donald S. Williamson (12 papers)
- Hanchao Yu (17 papers)