LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning (2503.15621v1)

Published 19 Mar 2025 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: Recent progress in Multimodal LLMs (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying LLM. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent LLMs with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

PDF Abstract

This paper introduces LLaVA-MORE, a family of Multimodal LLMs (MLLMs), designed to systematically evaluate the impact of different LLM backbones and visual encoders on multimodal performance. The authors highlight that while MLLMs have advanced rapidly, comparisons are difficult due to varying model components, training data, and evaluation protocols. LLaVA-MORE aims to address this by using a unified LLaVA-based training protocol applied consistently across various architectures (Cocchi et al., 19 Mar 2025 ).

The core architecture follows the standard LLaVA setup: a visual encoder, a vision-language adapter (a two-layer MLP), and an LLM backbone. The paper explores:

Small-scale LLMs: Phi-4-Mini (3.8B) and Gemma-2 (2B).
Medium-scale LLMs: LLaMA-3.1 (8B), DeepSeek-R1-Distill-LLaMA (8B), and Gemma-2 (9B).
Visual Backbones: CLIP ViT-L/14 (baseline), DINOv2 ViT-L/14 (with and without registers), SigLIP ViT-L/14, and SigLIP2 ViT-L/14.

Training follows a two-stage process:

Pre-training: The vision-language adapter is trained to align visual features (from a frozen encoder) with the LLM's embedding space using 558k image-caption pairs (from LAION, CC3M, SBU).
Visual Instruction Tuning: Both the adapter and the LLM are fine-tuned using high-quality visual instruction-following data to enhance conversational and reasoning abilities.

Experiments were conducted on various benchmarks:

VQA Benchmarks: GQA, ScienceQA, TextVQA, AI2D.
MLLM Benchmarks: POPE (hallucination), MME (perception/cognition), MMBench (multi-domain reasoning), SEED-Bench (multimodal comprehension), MMMU (expert-level reasoning).

Key findings include:

LLM Impact: Newer small-scale models like Phi-4-3.8B demonstrate performance comparable to or even exceeding older medium-scale models (e.g., LLaVA-1.5-7B) on several benchmarks, particularly in reasoning tasks (MMMU, SEED). Among medium-scale models, Gemma-2-9B excelled in VQA, while LLaMA-3.1-8B showed strength in MMBench.
Visual Backbone Impact: Visual encoders pre-trained with image-text contrastive learning (CLIP, SigLIP, SigLIP2) consistently outperformed self-supervised ones (DINOv2). SigLIP and SigLIP2 variants generally yielded the best results across benchmarks, despite requiring more visual tokens due to higher input resolution (384² vs. CLIP's 336²).
Image Resolution: Using the S² technique to increase effective image resolution generally improved performance, especially for smaller models (LLaVA-MORE-3.8B). However, benefits diminished or reversed for larger models (LLaVA-MORE-9B) on some tasks, suggesting a trade-off between resolution benefits and model scale/task type.
Pre-training Data: The choice of pre-training data had a noticeable impact on the small-scale model (LLaVA-MORE-3.8B), with LAION-only data performing well when paired with the SigLIP2 backbone. The medium-scale model (LLaVA-MORE-9B) was less sensitive, though Recap data showed benefits for Chinese language tasks (MMB-Cn).
No Universal Best: The results emphasize that no single combination of LLM and visual backbone excels across all tasks. Performance is highly dependent on the specific benchmark and task requirements.

The paper concludes by offering insights into designing effective MLLMs, emphasizing the competitiveness of recent small LLMs and the superiority of contrastively pre-trained visual backbones like SigLIP. The authors provide a reproducible framework and release their code and models to facilitate further research.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Federico Cocchi (7 papers)
Nicholas Moratelli (9 papers)
Davide Caffagni (5 papers)
Sara Sarto (12 papers)
Lorenzo Baraldi (68 papers)
Marcella Cornia (61 papers)
Rita Cucchiara (142 papers)

Related Papers

Find Related Papers

GitHub

GitHub - aimagelab/LLaVA-MORE: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning (107 stars)