Adapting LLaMA Decoder to Vision Transformer (2404.06773v4)

Published 10 Apr 2024 in cs.CV

Abstract: This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for LLMs, can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to $\sim$310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: shape-texture bias, calibration, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual architectures in the wave of LLMs and inspire the development of unified multimodal models. Pre-trained models and codes are available https://github.com/techmonsterwang/iLLaMA.

References (72)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces iLLaMA, a novel adaptation of decoder-only LLaMA for computer vision that achieves 75.1% top-1 accuracy on ImageNet with only 5.7M parameters.
The methodology leverages architectural modifications like SwiGLU in FFNs and a hybrid positional embedding strategy to enhance model performance and efficiency.
The study implements a soft mask training strategy to emulate human visual focus, ensuring smoother optimization and improved representation learning.

Adapting Decoder-Only Transformers for Computer Vision: The Advent of iLLaMA

Introduction

The convergence of LLMs and computer vision technologies holds promising avenues for redefining model architectures in vision tasks. The research introduces "image LLaMA" (iLLaMA), an innovative adaptation of decoder-only Transformers, typically used in LLMs, for visual perception. By tailoring the LLaMA architecture for image processing, iLLaMA addresses the architectural misalignment between textual and visual models, leveraging casual self-attention to enhance computational efficiency and representation learning.

Methodology

Architectural Modifications

iLLaMA's development involved several critical architectural adjustments to align with the LLaMA structure while addressing unique challenges in visual data processing:

Feed Forward Network (FFN) adjustments revealed that replacing MLPs with SwiGLUs, while maintaining computational load comparability, significantly boosts performance.
Normalization Layer Changes with RMSNorm replacing layer normalization (LN) showcased an interesting trade-off between complexity and accuracy across different model sizes.
Implementing Casual Self-Attention presented a unique set of challenges, including attention collapse. The introduction of a post-sequence class token technique and a modified casual mask effectively countered these issues.
Positional Embedding adaptations demonstrated that combining learnable positional embedding (LPE) with rotary positional embedding (RoPE) enhances model accuracy, suggesting synergies between disparate embedding approaches.

Training Technique Innovations

The paper also broke new ground with the introduction of soft mask strategies aimed at optimizing the training phase. This approach, inspired by human visual focus mechanisms, gradually transitions from bi-directional to casual self-attention, mirroring the progressive sharpening of attentional focus. This not only facilitates a smoother optimization landscape but also aligns closely with natural observation patterns, enhancing initial training behavior and model performance.

Experimental Insights

Computational Efficiency and Representation Learning

iLLaMA demonstrates notable advancements in computational efficiency through the tailored use of casual self-attention mechanisms. Furthermore, a detailed analysis of attention map ranks provided empirical evidence of iLLaMA's superior capability in learning complex image representations. This aspect is critical, as it underscores the model's potential in capturing intricate patterns with higher fidelity compared to its encoder-only counterparts.

Model Evaluation and Benchmarking

Extensive benchmarking across multiple datasets affirmed iLLaMA's competitiveness, showcasing remarkable performance with a significantly lower parameter count. Notably, iLLaMA achieved a 75.1\% top-1 accuracy on ImageNet with only 5.7M parameters. When scaled to approximately 310M parameters and pre-trained on ImageNet-21K, the model further pushed the accuracy envelope to 86.0\%.

Theoretical and Practical Implications

This research carves a niche for decoder-only Transformers in the vision domain, advocating for a paradigm shift in visual model design. The theoretical implications extend to the broader AI field, challenging prevailing norms around model architectures and prompting a reevaluation of the encoder-decoder dichotomy in model design strategies.

Future Perspectives

The advent of iLLaMA paves the way for more rigorous explorations into the integration of LLM architectures within the visual domain. Future research could explore the scalability of such models, explore optimization techniques, and expand the applicability of decoder-only architectures across a broader spectrum of visual tasks.

In summary, iLLaMA stands as a pivotal development in bridging the gap between textual and visual model architectures, offering fresh perspectives on leveraging the strengths of LLMs within the field of computer vision.