Papers
Topics
Authors
Recent
2000 character limit reached

Adapting LLaMA Decoder to Vision Transformer (2404.06773v4)

Published 10 Apr 2024 in cs.CV

Abstract: This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for LLMs, can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to $\sim$310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: shape-texture bias, calibration, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual architectures in the wave of LLMs and inspire the development of unified multimodal models. Pre-trained models and codes are available https://github.com/techmonsterwang/iLLaMA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
  3. Beit: Bert pre-training of image transformers. In ICLR, 2021.
  4. Kermit: Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604, 2019.
  5. Vanillanet: the power of minimalism in deep learning. In NeurIPS, 2023.
  6. Generative pretraining from pixels. In ICML, 2020.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Visionllama: A unified llama interface for vision tasks. arXiv preprint arXiv:2403.00522, 2024.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
  11. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  13. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, 2021.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  15. Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541, 2024.
  16. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 2018.
  17. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
  18. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  19. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  20. Non-autoregressive neural machine translation. In ICLR, 2018.
  21. Levenshtein transformer. In NeurIPS, 2019.
  22. On calibration of modern neural networks. In ICML, 2017.
  23. Data-efficient large vision models through sequential autoregression. arXiv preprint arXiv:2402.04841, 2024.
  24. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  25. Deep residual learning for image recognition. In CVPR, 2016.
  26. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  27. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  28. Deep networks with stochastic depth. In ECCV, 2016.
  29. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  30. Learning multiple layers of features from tiny images. 2009.
  31. Visualizing the loss landscape of neural nets. In Advances in neural information processing systems, 2018.
  32. Q-vit: Accurate and fully quantized low-bit vision transformer. In NeurIPS, 2022.
  33. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  34. A convnet for the 2020s. In CVPR, 2022.
  35. Dropout reduces underfitting. In ICML, 2023.
  36. Decoupled weight decay regularization. In ICLR, 2019.
  37. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  38. Frozen transformers in language models are effective visual encoder layers. arXiv preprint arXiv:2310.12973, 2023.
  39. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 1992.
  40. Searching for activation functions. In ICLR Workshop, 2018.
  41. Generating diverse high-fidelity images with vq-vae-2. In NeurIPS, 2019.
  42. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  43. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  44. Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437, 2020.
  45. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
  46. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  47. Adder attention for vision transformer. In NeurIPS, 2021.
  48. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
  49. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  50. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  51. Deit iii: Revenge of the vit. In ECCV, 2022.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Are convolutional neural networks or transformers more like human vision? arXiv preprint arXiv:2105.07197, 2021.
  55. Conditional image generation with pixelcnn decoders. 2016.
  56. Pixel recurrent neural networks. In ICML, 2016.
  57. Attention is all you need. In NIPS, 2017.
  58. Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy. arXiv preprint arXiv:2311.09215, 2023.
  59. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  60. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  61. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  62. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, 2023.
  63. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  64. Initializing models with larger ones. In ICLR, 2024.
  65. Metaformer is actually what you need for vision. In CVPR, 2022.
  66. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  67. Root mean square layer normalization. In NeurIPS, 2019.
  68. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  69. Opt: Open pre-trained transformer language models, 2022. arXiv preprint arXiv:2205.01068, 2023.
  70. Random erasing data augmentation. In AAAI, 2020.
  71. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  72. Understanding knowledge distillation in non-autoregressive machine translation. In ICLR, 2020.
Citations (1)

Summary

  • The paper introduces iLLaMA, a novel adaptation of decoder-only LLaMA for computer vision that achieves 75.1% top-1 accuracy on ImageNet with only 5.7M parameters.
  • The methodology leverages architectural modifications like SwiGLU in FFNs and a hybrid positional embedding strategy to enhance model performance and efficiency.
  • The study implements a soft mask training strategy to emulate human visual focus, ensuring smoother optimization and improved representation learning.

Adapting Decoder-Only Transformers for Computer Vision: The Advent of iLLaMA

Introduction

The convergence of LLMs and computer vision technologies holds promising avenues for redefining model architectures in vision tasks. The research introduces "image LLaMA" (iLLaMA), an innovative adaptation of decoder-only Transformers, typically used in LLMs, for visual perception. By tailoring the LLaMA architecture for image processing, iLLaMA addresses the architectural misalignment between textual and visual models, leveraging casual self-attention to enhance computational efficiency and representation learning.

Methodology

Architectural Modifications

iLLaMA's development involved several critical architectural adjustments to align with the LLaMA structure while addressing unique challenges in visual data processing:

  • Feed Forward Network (FFN) adjustments revealed that replacing MLPs with SwiGLUs, while maintaining computational load comparability, significantly boosts performance.
  • Normalization Layer Changes with RMSNorm replacing layer normalization (LN) showcased an interesting trade-off between complexity and accuracy across different model sizes.
  • Implementing Casual Self-Attention presented a unique set of challenges, including attention collapse. The introduction of a post-sequence class token technique and a modified casual mask effectively countered these issues.
  • Positional Embedding adaptations demonstrated that combining learnable positional embedding (LPE) with rotary positional embedding (RoPE) enhances model accuracy, suggesting synergies between disparate embedding approaches.

Training Technique Innovations

The paper also broke new ground with the introduction of soft mask strategies aimed at optimizing the training phase. This approach, inspired by human visual focus mechanisms, gradually transitions from bi-directional to casual self-attention, mirroring the progressive sharpening of attentional focus. This not only facilitates a smoother optimization landscape but also aligns closely with natural observation patterns, enhancing initial training behavior and model performance.

Experimental Insights

Computational Efficiency and Representation Learning

iLLaMA demonstrates notable advancements in computational efficiency through the tailored use of casual self-attention mechanisms. Furthermore, a detailed analysis of attention map ranks provided empirical evidence of iLLaMA's superior capability in learning complex image representations. This aspect is critical, as it underscores the model's potential in capturing intricate patterns with higher fidelity compared to its encoder-only counterparts.

Model Evaluation and Benchmarking

Extensive benchmarking across multiple datasets affirmed iLLaMA's competitiveness, showcasing remarkable performance with a significantly lower parameter count. Notably, iLLaMA achieved a 75.1\% top-1 accuracy on ImageNet with only 5.7M parameters. When scaled to approximately 310M parameters and pre-trained on ImageNet-21K, the model further pushed the accuracy envelope to 86.0\%.

Theoretical and Practical Implications

This research carves a niche for decoder-only Transformers in the vision domain, advocating for a paradigm shift in visual model design. The theoretical implications extend to the broader AI field, challenging prevailing norms around model architectures and prompting a reevaluation of the encoder-decoder dichotomy in model design strategies.

Future Perspectives

The advent of iLLaMA paves the way for more rigorous explorations into the integration of LLM architectures within the visual domain. Future research could explore the scalability of such models, explore optimization techniques, and expand the applicability of decoder-only architectures across a broader spectrum of visual tasks.

In summary, iLLaMA stands as a pivotal development in bridging the gap between textual and visual model architectures, offering fresh perspectives on leveraging the strengths of LLMs within the field of computer vision.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 12 tweets and received 374 likes.

Upgrade to Pro to view all of the tweets about this paper: