EVEv2: Improved Baselines for Encoder-Free Vision-Language Models (2502.06788v1)

Published 10 Feb 2025 in cs.CV and cs.AI

Abstract: Existing encoder-free vision-LLMs (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.

Authors (9)

Haiwen Diao (15 papers)
Xiaotong Li (21 papers)
Yufeng Cui (12 papers)
Yueze Wang (14 papers)
Haoge Deng (5 papers)
Ting Pan (10 papers)
Wenxuan Wang (128 papers)
Huchuan Lu (199 papers)
Xinlong Wang (56 papers)

Summary

The paper presents a comprehensive paper that improves upon previous encoder‐free vision–LLMs by developing a unified decoder-only architecture that minimizes modality interference while building visual perception from scratch. The work analyzes the inherent performance gap between models using pre-trained vision encoders and those that process raw image inputs without an external encoder, and it establishes an effective training strategy to close this gap using high-quality image–text data.

The proposed model, EVEv2.0, introduces several key innovations:

1. Architectural Decomposition and Modality-Wise Sparsity

The paper proposes fully decoupling the components of the Transformer by assigning modality‐specific parameters to the attention, LayerNorm, and feed-forward modules. This “Divide-and-Conquer” (DaC) design helps to mitigate catastrophic forgetting and reduces the interference between the language and visual modalities.
Formally, given a token sequence $x = (x_1, \dots, x_n)$ with modality indicator $u_i \in \{v, t\}$ , the attention layer employs separate weight matrices $W_Q^{u_i}$ , $W_K^{u_i}$ , and $W_V^{u_i}$ in the computation:

$\text{ATTN}(x; \{\theta_\text{attn}^u\}) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\,,$

where $Q_i = x_i W_Q^{u_i}$ , $K_i = x_i W_K^{u_i}$ , and $V_i = x_i W_V^{u_i}$ , with $d_k$ denoting the key dimension. This explicit modality‐aware factorization reduces conflicting gradients during joint optimization.

2. Lossless Visual Encoding via Minimalist Patch Embedding

Rather than using pre-trained visual backbones, the work initializes visual perception using a patch embedding layer built completely from scratch. The embedding processes an image input $I\in\mathbb{R}^{H \times W \times 3}$ as follows:

$x_v = \text{Conv2}(\text{GELU}(\text{Conv1}(I))),$

with Conv1 and Conv2 having strides of 16 and 2 respectively and producing high-dimensional representations. Special tokens, such as a class token and split tokens, are also introduced to signal prompt position and row boundaries.

3. Multi-Stage Training Process

The training procedure is divided into several stages to progressively adapt the LLM for vision–language tasks:
- In the first stage, only the patch embedding layer is trained (with the LLM weights frozen) using high-quality, recaptioned image–text data.
- In subsequent stages, the vision layers inside the LLM are unfrozen and individually optimized, gradually increasing image resolution (from 800×800 up to 1600×1600) to better capture fine-grained details.
- Finally, the entire model undergoes supervised fine-tuning on diverse question–answering and instruction-following datasets, ensuring strong cross-modal understanding and reasoning.

4. Empirical Evaluations and Ablation Studies

Extensive experiments compare EVEv2.0 against both encoder-based and prior encoder-free approaches on a variety of benchmarks (including tasks such as visual reasoning, OCR, and text-related challenges).
Notably, EVEv2.0 outperforms previous encoder-free models (e.g., Fuyu, EVE, SOLO) and continuously approaches the performance of encoder-based counterparts when using only 100M publicly available image–text pairs.
Ablation studies reveal that the Divide-and-Conquer design—in contrast to strategies based on re-parameterization or mixture-of-experts (MoE)—achieves faster convergence and better stability. For example, in experiments comparing loss curves and task accuracy (e.g., on ScienceQA), the complete modality-specific decomposition provided a clear advantage, with average accuracy improvements rising proportionally with the scale of pre-training data.

5. Data Scaling and High-Quality Captioning

The authors emphasize that a refined captioning engine based on stronger models (leveraging techniques akin to those used in advanced caption generators) substantially improves data quality and training efficiency.
The model benefits from a careful mixture of synthesized multimodal data, language-only data, and web-sourced data, which helps balance the preservation of pre-trained linguistic knowledge and the development of visual perception capabilities.

In summary, the paper outlines detailed methodological improvements—a novel decoder-only backbone with explicit modality decoupling, efficient patch embedding design, and a multi-stage, data-scaled training regime—that jointly lead to enhanced performance and data-scaling efficiency in encoder-free vision–LLMs. The granular experimental analysis and comprehensive ablation studies provide valuable insights into overcoming cross-modal interference, thereby offering a transparent and robust path toward next-generation, unified vision–language systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - baaivision/EVE: EVE Series: Encoder-Free Vision-Language Models from BAAI (274 stars)

Tweets

https://twitter.com/rohanpaul_ai/status/1892178344977834141

https://twitter.com/paranioar/status/1889192715159793826

https://twitter.com/arXivGPT/status/1889737950411305125