Unveiling Encoder-Free Vision-Language Models (2406.11832v2)

Published 17 Jun 2024 in cs.CV and cs.MM

Abstract: Existing vision-LLMs (VLMs) mostly rely on vision encoders to extract visual features followed by LLMs for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-LLM that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.

PDF HTML Abstract

Unveiling Encoder-Free Vision-LLMs

The paper "Unveiling Encoder-Free Vision-LLMs" presents a novel approach towards simplifying the architecture of Vision-LLMs (VLMs) by proposing an encoder-free, decoder-only structure with the introduction of the EVE model. The research addresses critical limitations present in conventional encoder-based VLMs, such as fixed image resolutions, deployment inefficiencies, and mismatched model capacities between vision and language components.

Key Contributions

Decoder-Only Architecture: EVE, the encoder-free model, eliminates the requirement for separate vision encoders, integrating both visual perception and language understanding within a single decoder framework. This architectural shift aligns better with the inherent structure of LLMs and offers greater flexibility in handling varying image resolutions and aspect ratios.
Training Recipes and Efficiency: The authors reveal effective training methods to enhance convergence and performance in encoder-free configurations. These include:
- Bridging vision-language representations using a unified decoder.
- Employing additional supervision to bolster visual recognition capabilities, thus maintaining the visual acuity traditionally provided by vision encoders.
Competitive Performance: Utilizing only 35 million publicly available dataset samples, EVE matches or surpasses the performance of similarly capacitated encoder-based VLMs across several benchmarks. Notably, it outperforms Fuyu-8B, another encoder-free VLM, despite its opaque training protocols and undisclosed data sources.

Methodology

The paper introduces a twofold experimental structure:

Patch Embedding and Aligning Layers: These components replace traditional deep encoders, allowing image data to flow almost losslessly into the VLM. By transmitting visual signals directly and aligning with textual labels and patch features, the model can circumvent the need for pre-trained, fixed-resolution vision encoders.
Three-Stage Training Procedure:
- Stage 1: Initial alignment of vision and language modalities, driven by pre-trained LLM stability. This step is crucial to avoid model collapse and stabilize convergence.
- Stage 2: Enhancement of vision-language cohesion through large-scale generative pretraining. Ensures the unified decoder maintains balanced capabilities across modalities.
- Stage 3: Supervised fine-tuning across vision-language and NLP-specific tasks to refine instruction-following and dialog comprehension.

Performance Evaluation

EVE was evaluated on diverse benchmarks, including standardized datasets like VQA-v2, GQA, VizWiz, and TextVQA. It demonstrated superior efficiency and competitive performance, especially when lifting fundamental constraints such as fixed image resolutions, prevalent in existing VLMs. Additionally, lower computational overhead in deployment and substantial improvements in inference latency were observed, positioning EVE as a practical alternative for real-world applications.

Implications and Future Directions

The EVE model opens new avenues for research in multi-modal AI, where encoder-free architectures can enhance VLM agility, deployment, and resource efficiency while maintaining or exceeding the performance of encoder-based benchmarks. Future work can extend this approach to incorporate broader modalities such as audio or video, leveraging the intrinsic adaptability and streamlined processing proposed in the EVE framework.

The exploration of additional scalable and comprehensive training datasets could further refine both the vision and language capabilities of encoder-free models, bridging any remaining performance gaps with traditional VLMs. Potential research could also investigate mixed-data strategies or experts' mixtures to alleviate language skill dilution during extensive vision-language training phases.

In summary, by pioneering an encoder-free structural paradigm, the paper furnishes a promising and efficient route to advancing VLM development, addressing critical limitations inherent in traditional approaches.