EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
(2502.06788v1)
Published 10 Feb 2025 in cs.CV and cs.AI
Abstract: Existing encoder-free vision-LLMs (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.
The paper presents a comprehensive paper that improves upon previous encoder‐free vision–LLMs by developing a unified decoder-only architecture that minimizes modality interference while building visual perception from scratch. The work analyzes the inherent performance gap between models using pre-trained vision encoders and those that process raw image inputs without an external encoder, and it establishes an effective training strategy to close this gap using high-quality image–text data.
The proposed model, EVEv2.0, introduces several key innovations:
1. Architectural Decomposition and Modality-Wise Sparsity
The paper proposes fully decoupling the components of the Transformer by assigning modality‐specific parameters to the attention, LayerNorm, and feed-forward modules. This “Divide-and-Conquer” (DaC) design helps to mitigate catastrophic forgetting and reduces the interference between the language and visual modalities.
Formally, given a token sequence x=(x1,…,xn) with modality indicator ui∈{v,t}, the attention layer employs separate weight matrices WQui, WKui, and WVui in the computation:
ATTN(x;{θattnu})=softmax(dkQKT)V,
where Qi=xiWQui, Ki=xiWKui, and Vi=xiWVui, with dk denoting the key dimension. This explicit modality‐aware factorization reduces conflicting gradients during joint optimization.
2. Lossless Visual Encoding via Minimalist Patch Embedding
Rather than using pre-trained visual backbones, the work initializes visual perception using a patch embedding layer built completely from scratch. The embedding processes an image input I∈RH×W×3 as follows:
xv=Conv2(GELU(Conv1(I))),
with Conv1 and Conv2 having strides of 16 and 2 respectively and producing high-dimensional representations. Special tokens, such as a class token and split tokens, are also introduced to signal prompt position and row boundaries.
3. Multi-Stage Training Process
The training procedure is divided into several stages to progressively adapt the LLM for vision–language tasks:
In the first stage, only the patch embedding layer is trained (with the LLM weights frozen) using high-quality, recaptioned image–text data.
In subsequent stages, the vision layers inside the LLM are unfrozen and individually optimized, gradually increasing image resolution (from 800×800 up to 1600×1600) to better capture fine-grained details.
Finally, the entire model undergoes supervised fine-tuning on diverse question–answering and instruction-following datasets, ensuring strong cross-modal understanding and reasoning.
4. Empirical Evaluations and Ablation Studies
Extensive experiments compare EVEv2.0 against both encoder-based and prior encoder-free approaches on a variety of benchmarks (including tasks such as visual reasoning, OCR, and text-related challenges).
Notably, EVEv2.0 outperforms previous encoder-free models (e.g., Fuyu, EVE, SOLO) and continuously approaches the performance of encoder-based counterparts when using only 100M publicly available image–text pairs.
Ablation studies reveal that the Divide-and-Conquer design—in contrast to strategies based on re-parameterization or mixture-of-experts (MoE)—achieves faster convergence and better stability. For example, in experiments comparing loss curves and task accuracy (e.g., on ScienceQA), the complete modality-specific decomposition provided a clear advantage, with average accuracy improvements rising proportionally with the scale of pre-training data.
5. Data Scaling and High-Quality Captioning
The authors emphasize that a refined captioning engine based on stronger models (leveraging techniques akin to those used in advanced caption generators) substantially improves data quality and training efficiency.
The model benefits from a careful mixture of synthesized multimodal data, language-only data, and web-sourced data, which helps balance the preservation of pre-trained linguistic knowledge and the development of visual perception capabilities.
In summary, the paper outlines detailed methodological improvements—a novel decoder-only backbone with explicit modality decoupling, efficient patch embedding design, and a multi-stage, data-scaled training regime—that jointly lead to enhanced performance and data-scaling efficiency in encoder-free vision–LLMs. The granular experimental analysis and comprehensive ablation studies provide valuable insights into overcoming cross-modal interference, thereby offering a transparent and robust path toward next-generation, unified vision–language systems.