HSViT: Horizontally Scalable Vision Transformer (2404.05196v2)
Abstract: Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to devices with limited computing resources. To mitigate the aforementioned challenges, this paper introduces a novel horizontally scalable vision transformer (HSViT) scheme. Specifically, a novel image-level feature embedding is introduced to ViT, where the preserved inductive bias allows the model to eliminate the need for pre-training while outperforming on small datasets. Besides, a novel horizontally scalable architecture is designed, facilitating collaborative model training and inference across multiple computing devices. The experimental results depict that, without pre-training, HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets, while providing existing CNN backbones up to 3.1% improvement in top-1 accuracy on ImageNet. The code is available at https://github.com/xuchenhao001/HSViT.
- Food-101 – Mining Discriminative Components with Random Forests. In European Conference on Computer Vision.
- End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
- FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2482–2491.
- Mobile augmented reality survey: From where we are to where we go. Ieee Access 5 (2017), 6917–6950.
- Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems 34 (2021), 19974–19988.
- When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations. In International Conference on Learning Representations.
- Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11030–11039.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12175–12185.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Energy efficient multimedia streaming to mobile devices—A survey. IEEE Communications Surveys & Tutorials 16, 1 (2012), 579–597.
- Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report.
- Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems 35 (2022), 12934–12949.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12009–12019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
- Post-training quantization for vision transformer. Advances in Neural Information Processing Systems 34 (2021), 28092–28103.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
- Sachin Mehta and Mohammad Rastegari. 2022. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In International Conference on Learning Representations.
- Deep learning for mobile multimedia: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 13, 3s (2017), 1–22.
- On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 815–825.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
- Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34 (2021), 12116–12128.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510–4520.
- SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 17425–17436.
- Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105–6114.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- A survey on distributed machine learning. Acm computing surveys (csur) 53, 2 (2020), 1–33.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 22–31.
- Tiny imagenet challenge. Technical Report.
- Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4794–4803.
- Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:cs.LG/1708.07747 [cs.LG]
- Deep Learning Techniques for Video Instance Segmentation: A Survey. arXiv preprint arXiv:2310.12393 (2023).
- Asynchronous federated learning on heterogeneous devices: A survey. Computer Science Review 50 (2023), 100595.
- Width & depth pruning for vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3143–3151.
- Bootstrapping ViTs: Towards liberating vision transformers from pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8944–8953.
- Chenhao Xu (14 papers)
- Chang-Tsun Li (22 papers)
- Chee Peng Lim (19 papers)
- Douglas Creighton (3 papers)