Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis (2403.18063v2)
Abstract: Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{https://github.com/badripatro/heracles}
- Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2021.
- Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2022.
- Regionvit: Regional-to-local attention for vision transformers. In International Conference on Learning Representations, 2022a.
- Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 357–366, 2021.
- Cyclemlp: A mlp-like architecture for dense prediction. In International Conference on Learning Representations, 2022b.
- Fast fourier convolution. Advances in Neural Information Processing Systems, 33:4479–4488, 2020.
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021.
- Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2022.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp. 7480–7512. PMLR, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Davit: Dual attention vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 74–92. Springer, 2022.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pp. 2286–2296. PMLR, 2021.
- Spectral neural operators. arXiv preprint arXiv:2205.10573, 2022.
- Efficient token mixing for transformers via adaptive fourier neural operators. In International Conference on Learning Representations, 2022.
- Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12175–12185, 2022a.
- Hire-mlp: Vision mlp via hierarchical rearrangement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 826–836, June 2022b.
- Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
- Hartley, R. A more symmetrical fourier analysis applied to transmission problems. Proceedings of the IRE, 30(3):144–150, 1942.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–1, 2022.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
- Wavemix: Resource-efficient token mixing for images. arXiv preprint arXiv:2203.03689, 2022.
- All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information Processing Systems, 34:18590–18602, 2021.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561, 2013.
- Krizhevsky, A. et al. Learning multiple layers of features from tiny images. 2009.
- Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
- Local-to-global self-attention in vision transformers. arXiv preprint arXiv:2107.04735, 2021a.
- Uniformer: Unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022a.
- Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021b.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4804–4814, 2022b.
- Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022c.
- Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2020.
- As-mlp: An axial shifted mlp architecture for vision. In International Conference on Learning Representations, 2022.
- Pay attention to mlps. Advances in Neural Information Processing Systems, 34:9204–9215, 2021a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021b.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019, 2022.
- Fourierformer: Transformer meets generalized fourier integral theorem. In Advances in Neural Information Processing Systems, 2022.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE, 2008.
- Fast vision transformers with hilo attention. In Advances in Neural Information Processing Systems, 2022a.
- Less is more: Pay less attention in vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 2035–2043, 2022b.
- Scattering vision transformer: Spectral mixing matters. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Spectformer: Frequency and attention is what you need in a vision transformer. arXiv preprint arXiv:2304.06446, 2023.
- Global filter networks for image classification. Advances in Neural Information Processing Systems, 34:980–993, 2021.
- Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Advances in Neural Information Processing Systems, 35:10353–10366, 2022.
- Convolutional neural operators. arXiv preprint arXiv:2302.01178, 2023.
- Inception transformer. In Advances in Neural Information Processing Systems, 2022.
- Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16519–16529, 2021.
- Beyond ai exposure: Which tasks are cost-effective to automate with computer vision? 2024.
- An image patch is a wave: Phase-aware vision mlp. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10935–10944, 2022.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021a.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42, 2021b.
- Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Maxvit: Multi-axis vision transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 459–479. Springer, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Scaled relu matters for training vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 2495–2503, 2022a.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578, 2021.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022b.
- Dynamixer: a vision mlp architecture with dynamic mixing. In International Conference on Machine Learning, pp. 22691–22701. PMLR, 2022c.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31, 2021.
- Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4794–4803, 2022.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500, 2017.
- Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9981–9990, 2021.
- Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
- Wave-vit: Unifying wavelet and transformers for visual representation learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV, pp. 328–345. Springer, 2022.
- Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819–10829, 2022.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567, 2021.
- Volo: Vision outlooker for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133, 2022.
- Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2736–2746, 2022.
- Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3008, 2021.
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.