LLaVA-MORE: Multimodal LLM Advances

Updated 20 November 2025

LLaVA-MORE is a multimodal framework that systematically extends the LLaVA architecture by incorporating comparative benchmarks, sparse MoE techniques, and retrieval-augmented generation.
It utilizes a strict two-stage training protocol—vision-language alignment followed by instruction tuning—to standardize comparisons across various LLMs and visual backbones.
The approach robustly improves performance in tasks like VQA and medical NLE by leveraging MoE-tuning and KG-RAG modules, ensuring reproducible and scalable advances.

LLaVA-MORE denotes several interconnected advances in multimodal LLMs (MLLMs), centered on extending and systematizing the LLaVA architecture under three principal research thrusts: (1) comparative benchmarking of vision and language backbones, (2) deployment of mixture-of-experts (MoE) strategies for scaling multimodal models efficiently, and (3) domain-specific knowledge augmentation, exemplified in medical natural language explanation tasks. Models and analysis under the LLaVA-MORE umbrella provide a reproducible framework for rigorous ablation of LLM and vision encoder choices, new sparse MoE techniques for scalable inference, and pluggable retrieval-augmented generation (RAG) modules for precision in high-stakes applications.

1. Core LLaVA-MORE Architecture and Training Protocols

The architectural foundation of LLaVA-MORE preserves the canonical LLaVA pipeline: a frozen vision backbone (usually ViT-based, e.g., CLIP ViT-L/14, SigLIP, DINOv2), a “vision-to-language” adapter (typically a 2-layer MLP or a lightweight transformer), and a decoder LLM (Gemma-2, Phi-4, LLaMA-3.1, DeepSeek-R1-Distill-LLaMA, etc.) (Cocchi et al., 19 Mar 2025). Visual patch tokens $V = \{v_1, \ldots, v_N\}$ are extracted, projected to the LLM embedding space via the adapter, and prepended to the tokenized prompt before Transformer decoding. Model variants differ primarily in backbone, adapter, and LLM choice.

Training proceeds in two strict stages:

Stage 1 (Vision-Language Alignment): The adapter is trained (LLM frozen) on large-scale image-caption pairs so visual tokens map into the LLM’s embedding space.
Stage 2 (Instruction Tuning): Both the adapter and (typically) the LLM are jointly optimized on GPT-4-annotated visual instruction data to ensure multimodal instruction following.

Loss is standard auto-regressive cross-entropy,

$\mathcal{L}_{\rm CE} = -\sum_{t=1}^{T} \log p_\theta(w_t \mid w_{<t}, h_{1:N}) \,.$

This strict protocol ensures comparability across backbones and LLMs, with every hyperparameter and dataset selection consistently controlled (Cocchi et al., 19 Mar 2025).

2. Systematic Backbone and LLM Comparison

LLaVA-MORE rigorously assesses the effect of LLM size, architecture, and vision pretraining strategies. The comparative evaluation includes five LLMs (Gemma-2 2B/9B, Phi-4 3.8B, LLaMA-3.1 8B, DeepSeek-R1-Distill-LLaMA 8B) and visual encoders (CLIP, DINOv2, DINOv2 $_{\rm reg}$ , SigLIP, SigLIP2), with input resolutions and patch token counts explicitly enumerated (Cocchi et al., 19 Mar 2025).

Unified benchmarking spans VQA (GQA, ScienceQA, TextVQA, AI2D), instruction-following (POPE, MME, MMBench, SEED-Bench, MMMU), with all models trained under identical recipes. Results indicate that:

Small-scale LLMs (e.g., Phi-4-3.8B) can match or exceed LLaVA-1.5-7B.
Contrastive visual backbones (SigLIP2, CLIP) consistently outperform self-supervised (DINOv2).
SigLIP2 achieves ∼1 point gain over CLIP for GQA, ScienceQA, and AI2D (with higher input resolution/patch count).

A task-dependent pattern emerges: larger LLMs exhibit more robustness to alignment data and backbone type, while smaller LLMs benefit most from increased visual resolution and multi-scale processing.

3. Mixture-of-Experts Extensions: Sparse Scaling and MoE-Tuning

MoE-LLaVA (“LLaVA-MORE” in the MoE literature) introduces a sparse computation paradigm enabling efficient scaling of LVLMs (Lin et al., 2024). The key innovation is the introduction of MoE layers—each comprising $E$ independent FFNs and a learned “router” $W_r$ —interleaved with MSA layers. At inference, only top- $k$ experts per token are activated, yielding the following computational profile:

Parameters: $E \times$ the per-layer parameter count, but
Compute (FLOPs): scales with $k \ll E$ (i.e., nearly constant compared to dense models of same active width).

MoE-Tuning proceeds in three stages:

MLP warm-up: train only the visual projector.
Dense instruction tuning: finetune the full model on multimodal instruction data.
Sparse MoE-tuning: replace each FFN with $E$ experts, freeze all but the new MoE router+expert layers, then sparsity-train only these.

The router assigns tokens to experts with both soft and hard assignment regularization, enforcing balanced load: $g(x) = W_r^\top x; \qquad P_i(x) = \frac{\exp(g_i(x))}{\sum_{j=1}^E \exp(g_j(x))}$

$\mathrm{MoE}(x) = \sum_{i \in \mathcal{S}_k(x)} P_i(x) \mathrm{FFN}_i(x)$

$L_{\mathrm{aux}} = \sum_{\ell} L_{\rm load}^{(\ell)}$

where $L_{\rm load}^{(\ell)}$ regulates per-expert token assignment uniformity (Lin et al., 2024).

Empirically, MoE-LLaVA matches or exceeds dense models (LLaVA-1.5-7B/13B) on VQA-v2, GQA, ScienceQA, and POPE hallucination, while activating only 2.2–3.6B parameters per sample. Notably, POPE hallucination F1 improves ∼1–2 points over much larger dense models. This demonstrates that, with careful tuning, LVLMs can scale to “outrageous” parameter counts without increasing real inference cost, with routers learning meaningful cross-modal expert specializations.

4. Knowledge-Augmented LLaVA-MORE: KG-RAG for Medical NLE

In knowledge-critical domains, LLaVA-MORE admits plug-and-play retrieval-augmented generation. The KG-LLaVA system integrates a cross-modal knowledge graph (KG) retrieval protocol and injects structured, domain-specific evidence as context (Hamza et al., 2024).

The vision backbone (CLIP ViT-L/14, MedCLIP, or Bio-ViT-L) and projector generate embeddings for a medical image (e.g., chest X-ray).
A knowledge graph of (finding, relation, disease) triplets is constructed from MIMIC-CXR reports using RadGraph, embedded with MedCLIP’s text encoder and indexed via FAISS.
At inference, the image embedding retrieves the top- $K$ KG triplets by cosine similarity in CLIP/MedCLIP space.
These evidence triplets are synthesized into plaintext and prepended as a prompt to the LLM decoder (LLaVA-LLM or GPT-2), which is LoRA-adapted for efficiency.

Across the main instantiations (KG-LLaVA, Med-XPT, Bio-LLaVA), the approach produces:

Enhanced diagnostic and explanation performance: AUC 83.0 and CIDEr 62.2 (KG-LLaVA), a near-tripling over baseline models.
Maintained privacy: only abstracted KG facts, not patient/image data, are indexed or retrieved.

This methodology robustly outperforms both uni-modal and vanilla RAG baselines, substantiating the value of knowledge graph augmentation in visual-language scientific explanation (Hamza et al., 2024).

5. Quantitative Results and Benchmark Comparisons

Controlled benchmarking in LLaVA-MORE yields several robust findings across model scale and design (Cocchi et al., 19 Mar 2025, Lin et al., 2024, Hamza et al., 2024):

Configuration	GQA	ScienceQA	POPE (F1)	CIDEr (MIMIC-NLE)
LLaVA-1.5-7B (CLIP)	62.4	69.0	85.6	–
Phi-4-3.8B (CLIP)	62.1	71.3	85.9	–
Gemma-2-9B (SigLIP2)	63.4	71.8	86.5	–
MoE-LLaVA-2.7B×4-Top2	62.6	43.7	88.5	–
KG-LLaVA (medical)	–	–	–	62.2
Bio-LLaVA (medical)	–	–	–	46.7
Med-XPT (medical)	–	–	–	62.7

*All numbers as reported in their respective sources; “–” indicates not measured for that entry.

Notable, in multimodal NLE generation (MIMIC-NLE), KG-LLaVA achieves 83.0 AUC and 62.2 CIDEr, up from 66.4/37.9 (RATCHET baseline) and 2.4/17.4 (DPT), nearly tripling explanation quality in this medical setting (Hamza et al., 2024).

6. Design Insights and Future Directions

The LLaVA-MORE body of work establishes several design principles for the next generation of multimodal foundation models:

MoE sparsification (with a multi-stage schedule) enables parameter scaling at constant FLOPs, but requires careful alignment and load balancing to avoid performance collapse (Lin et al., 2024).
Visual backbone selection (contrastive vs. self-supervised, resolution, patch count) and data curation (alignment corpus, instruction quality) significantly impact performance–mainly for smaller LLMs (Cocchi et al., 19 Mar 2025). No single configuration is optimal across all tasks.
Domain knowledge augmentation via KG-RAG protocols yields large, measurable gains in explainability and factual correctness for knowledge-intensive tasks, and can be realized in a privacy-preserving, reusable manner (Hamza et al., 2024).
Explicit separation of vision-language alignment from instruction tuning, and rigorous control over training protocol and data, are necessary for apples-to-apples benchmarking.
Reproducibility and systematic evaluation are central; the full LLaVA-MORE codebase and evaluation pipelines are open-source for the community.

A plausible implication is that future LLaVA-MORE models will span increasingly diverse modalities (video, audio), benefit from expanded expert mixtures, and see more extensive integration of retrieval and KG protocols for scientific reasoning and explanation.

References

"LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning" (Cocchi et al., 19 Mar 2025)
"MoE-LLaVA: Mixture of Experts for Large Vision-LLMs" (Lin et al., 2024)
"LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies" (Hamza et al., 2024)