Adapting LLaMA Decoder to Vision Transformer (2404.06773v4)
Abstract: This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for LLMs, can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to $\sim$310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: shape-texture bias, calibration, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual architectures in the wave of LLMs and inspire the development of unified multimodal models. Pre-trained models and codes are available https://github.com/techmonsterwang/iLLaMA.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
- Beit: Bert pre-training of image transformers. In ICLR, 2021.
- Kermit: Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604, 2019.
- Vanillanet: the power of minimalism in deep learning. In NeurIPS, 2023.
- Generative pretraining from pixels. In ICML, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Visionllama: A unified llama interface for vision tasks. arXiv preprint arXiv:2403.00522, 2024.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541, 2024.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 2018.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
- Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Non-autoregressive neural machine translation. In ICLR, 2018.
- Levenshtein transformer. In NeurIPS, 2019.
- On calibration of modern neural networks. In ICML, 2017.
- Data-efficient large vision models through sequential autoregression. arXiv preprint arXiv:2402.04841, 2024.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Deep residual learning for image recognition. In CVPR, 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Deep networks with stochastic depth. In ECCV, 2016.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Learning multiple layers of features from tiny images. 2009.
- Visualizing the loss landscape of neural nets. In Advances in neural information processing systems, 2018.
- Q-vit: Accurate and fully quantized low-bit vision transformer. In NeurIPS, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- A convnet for the 2020s. In CVPR, 2022.
- Dropout reduces underfitting. In ICML, 2023.
- Decoupled weight decay regularization. In ICLR, 2019.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Frozen transformers in language models are effective visual encoder layers. arXiv preprint arXiv:2310.12973, 2023.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 1992.
- Searching for activation functions. In ICLR Workshop, 2018.
- Generating diverse high-fidelity images with vq-vae-2. In NeurIPS, 2019.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020.
- Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Adder attention for vision transformer. In NeurIPS, 2021.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
- Rethinking the inception architecture for computer vision. In CVPR, 2016.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Deit iii: Revenge of the vit. In ECCV, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Are convolutional neural networks or transformers more like human vision? arXiv preprint arXiv:2105.07197, 2021.
- Conditional image generation with pixelcnn decoders. 2016.
- Pixel recurrent neural networks. In ICML, 2016.
- Attention is all you need. In NIPS, 2017.
- Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy. arXiv preprint arXiv:2311.09215, 2023.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Convnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, 2023.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Initializing models with larger ones. In ICLR, 2024.
- Metaformer is actually what you need for vision. In CVPR, 2022.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
- Root mean square layer normalization. In NeurIPS, 2019.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Opt: Open pre-trained transformer language models, 2022. arXiv preprint arXiv:2205.01068, 2023.
- Random erasing data augmentation. In AAAI, 2020.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
- Understanding knowledge distillation in non-autoregressive machine translation. In ICLR, 2020.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.