The paper presents a comprehensive paper that improves upon previous encoder‐free vision–LLMs by developing a unified decoder-only architecture that minimizes modality interference while building visual perception from scratch. The work analyzes the inherent performance gap between models using pre-trained vision encoders and those that process raw image inputs without an external encoder, and it establishes an effective training strategy to close this gap using high-quality image–text data.
The proposed model, EVEv2.0, introduces several key innovations:
1. Architectural Decomposition and Modality-Wise Sparsity
- The paper proposes fully decoupling the components of the Transformer by assigning modality‐specific parameters to the attention, LayerNorm, and feed-forward modules. This “Divide-and-Conquer” (DaC) design helps to mitigate catastrophic forgetting and reduces the interference between the language and visual modalities.
- Formally, given a token sequence with modality indicator , the attention layer employs separate weight matrices , , and in the computation:
where , , and , with denoting the key dimension. This explicit modality‐aware factorization reduces conflicting gradients during joint optimization.
2. Lossless Visual Encoding via Minimalist Patch Embedding
- Rather than using pre-trained visual backbones, the work initializes visual perception using a patch embedding layer built completely from scratch. The embedding processes an image input as follows:
with Conv1 and Conv2 having strides of 16 and 2 respectively and producing high-dimensional representations. Special tokens, such as a class token and split tokens, are also introduced to signal prompt position and row boundaries.
3. Multi-Stage Training Process
- The training procedure is divided into several stages to progressively adapt the LLM for vision–language tasks:
- In the first stage, only the patch embedding layer is trained (with the LLM weights frozen) using high-quality, recaptioned image–text data.
- In subsequent stages, the vision layers inside the LLM are unfrozen and individually optimized, gradually increasing image resolution (from 800×800 up to 1600×1600) to better capture fine-grained details.
- Finally, the entire model undergoes supervised fine-tuning on diverse question–answering and instruction-following datasets, ensuring strong cross-modal understanding and reasoning.
4. Empirical Evaluations and Ablation Studies
- Extensive experiments compare EVEv2.0 against both encoder-based and prior encoder-free approaches on a variety of benchmarks (including tasks such as visual reasoning, OCR, and text-related challenges).
- Notably, EVEv2.0 outperforms previous encoder-free models (e.g., Fuyu, EVE, SOLO) and continuously approaches the performance of encoder-based counterparts when using only 100M publicly available image–text pairs.
- Ablation studies reveal that the Divide-and-Conquer design—in contrast to strategies based on re-parameterization or mixture-of-experts (MoE)—achieves faster convergence and better stability. For example, in experiments comparing loss curves and task accuracy (e.g., on ScienceQA), the complete modality-specific decomposition provided a clear advantage, with average accuracy improvements rising proportionally with the scale of pre-training data.
5. Data Scaling and High-Quality Captioning
- The authors emphasize that a refined captioning engine based on stronger models (leveraging techniques akin to those used in advanced caption generators) substantially improves data quality and training efficiency.
- The model benefits from a careful mixture of synthesized multimodal data, language-only data, and web-sourced data, which helps balance the preservation of pre-trained linguistic knowledge and the development of visual perception capabilities.
In summary, the paper outlines detailed methodological improvements—a novel decoder-only backbone with explicit modality decoupling, efficient patch embedding design, and a multi-stage, data-scaled training regime—that jointly lead to enhanced performance and data-scaling efficiency in encoder-free vision–LLMs. The granular experimental analysis and comprehensive ablation studies provide valuable insights into overcoming cross-modal interference, thereby offering a transparent and robust path toward next-generation, unified vision–language systems.