This paper introduces LLaVA-MORE, a family of Multimodal LLMs (MLLMs), designed to systematically evaluate the impact of different LLM backbones and visual encoders on multimodal performance. The authors highlight that while MLLMs have advanced rapidly, comparisons are difficult due to varying model components, training data, and evaluation protocols. LLaVA-MORE aims to address this by using a unified LLaVA-based training protocol applied consistently across various architectures (Cocchi et al., 19 Mar 2025 ).
The core architecture follows the standard LLaVA setup: a visual encoder, a vision-language adapter (a two-layer MLP), and an LLM backbone. The paper explores:
- Small-scale LLMs: Phi-4-Mini (3.8B) and Gemma-2 (2B).
- Medium-scale LLMs: LLaMA-3.1 (8B), DeepSeek-R1-Distill-LLaMA (8B), and Gemma-2 (9B).
- Visual Backbones: CLIP ViT-L/14 (baseline), DINOv2 ViT-L/14 (with and without registers), SigLIP ViT-L/14, and SigLIP2 ViT-L/14.
Training follows a two-stage process:
- Pre-training: The vision-language adapter is trained to align visual features (from a frozen encoder) with the LLM's embedding space using 558k image-caption pairs (from LAION, CC3M, SBU).
- Visual Instruction Tuning: Both the adapter and the LLM are fine-tuned using high-quality visual instruction-following data to enhance conversational and reasoning abilities.
Experiments were conducted on various benchmarks:
- VQA Benchmarks: GQA, ScienceQA, TextVQA, AI2D.
- MLLM Benchmarks: POPE (hallucination), MME (perception/cognition), MMBench (multi-domain reasoning), SEED-Bench (multimodal comprehension), MMMU (expert-level reasoning).
Key findings include:
- LLM Impact: Newer small-scale models like Phi-4-3.8B demonstrate performance comparable to or even exceeding older medium-scale models (e.g., LLaVA-1.5-7B) on several benchmarks, particularly in reasoning tasks (MMMU, SEED). Among medium-scale models, Gemma-2-9B excelled in VQA, while LLaMA-3.1-8B showed strength in MMBench.
- Visual Backbone Impact: Visual encoders pre-trained with image-text contrastive learning (CLIP, SigLIP, SigLIP2) consistently outperformed self-supervised ones (DINOv2). SigLIP and SigLIP2 variants generally yielded the best results across benchmarks, despite requiring more visual tokens due to higher input resolution (384² vs. CLIP's 336²).
- Image Resolution: Using the S² technique to increase effective image resolution generally improved performance, especially for smaller models (LLaVA-MORE-3.8B). However, benefits diminished or reversed for larger models (LLaVA-MORE-9B) on some tasks, suggesting a trade-off between resolution benefits and model scale/task type.
- Pre-training Data: The choice of pre-training data had a noticeable impact on the small-scale model (LLaVA-MORE-3.8B), with LAION-only data performing well when paired with the SigLIP2 backbone. The medium-scale model (LLaVA-MORE-9B) was less sensitive, though Recap data showed benefits for Chinese language tasks (MMB-Cn).
- No Universal Best: The results emphasize that no single combination of LLM and visual backbone excels across all tasks. Performance is highly dependent on the specific benchmark and task requirements.
The paper concludes by offering insights into designing effective MLLMs, emphasizing the competitiveness of recent small LLMs and the superiority of contrastively pre-trained visual backbones like SigLIP. The authors provide a reproducible framework and release their code and models to facilitate further research.