A Single Transformer for Scalable Vision-LLMing
Overview and Contribution
The paper introduces "SOLO," a novel large vision-LLM (LVLM) employing a single transformer architecture designed to address scalability issues inherent in existing models that rely on pre-trained visual encoders connected with LLMs. The authors identify four core limitations in current LVLMs:
- Constrained visual capacity due to smaller pre-trained visual encoders.
- Complicated deployment due to heterogeneous architectures.
- Complex scaling analysis involving multiple components.
- Issues in preprocessing images with fixed resolution requirements, limiting the ability to handle high-resolution or irregularly shaped images effectively.
This work proposes the unified transformer-based SOLO to obviate these limitations by processing both images and text inputs using the same model architecture. A key contribution of the paper is providing the first open-source training recipe for this vision-LLMing approach, including initializations from LLMs, sequential pre-training, and instruction fine-tuning using moderate-sized computational infrastructure (8 x A100 80GB GPUs).
Model Design and Training
The architectural innovation centers on a single transformer model initialized from Mistral-7B-v0.1. This approach involves partitioning images into patches that align with the transformer’s input size, utilizing special tokens for visual encoding. This design choice facilitates high scalability and ease of deployment by circumventing the constraints of pre-trained visual encoders.
The training recipe spans three stages:
- Stage-1: Pre-training on ImageNet21K to build foundational visual representations.
- Stage-2: Leveraging web-scale datasets for broader knowledge and data volume enhancements.
- Stage-3: Annealing to smoothly transition from noisy web data to high-quality curated datasets.
Validation studies confirm that without the initial stage of ImageNet pre-training, models generate meaningless captions despite achieving comparable vision-LLMing loss. This underscores the necessity of a carefully phased training approach.
Evaluation
Extensive evaluations were performed comparing SOLO with existing LVLMs across several benchmarks, including MMStar, MME, and SEED-Bench, as well as specialized datasets like AI2D and MathVista. SOLO exhibited performance on par with mid-2024 models like LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning. Although currently trailing state-of-the-art (SoTA) LVLMs from late 2024, SOLO showcased substantial advantages in scalability and adaptability, marking it as a strong foundation for future developments.
Implications and Future Directions
The unified transformer architecture's simplification indicates a promising direction for future scalable AI models. Some areas for exploration include:
- Improving SOLO's foundational language abilities without compromising vision-capabilities through incorporation of higher-quality textual datasets.
- Establishing reliable metrics that can accurately forecast downstream task performance during the pre-training phase.
- Enhancing supervised fine-tuning datasets to mitigate overfitting risks associated with repetitive exposure.
By addressing current limitations with pre-trained visual encoders, SOLO demonstrates that a unified transformer approach can maintain competitive performance levels while facilitating more straightforward scaling, training, and deployment.
Conclusion
This work signifies a notable shift in vision-LLMing, presenting a scalable, unified transformer-based framework as a viable alternative to models reliant on pre-trained encoders. The extensive analysis and reproducible training recipe provided offer a strong foundation for future research and practical applications in scalable vision-LLMing. As this field advances, the approach and insights detailed in this paper are poised to play a critical role in shaping the next generation of AI systems.