Visually-Grounded Language Models (VLMs)
- Visually-Grounded Language Models are architectures that integrate visual encoders, fusion modules, and language models to process images and text concurrently.
- They leverage modular designs and large-scale, multimodal datasets to support tasks like retrieval, captioning, and question answering.
- Despite their strengths in multimodal applications, VLMs face challenges in fine-grained spatial reasoning, visual grounding reliability, and image classification tasks.
Visually-Grounded LLMs (VLMs) are a class of machine learning architectures that jointly process visual and linguistic information, enabling models to reason about and generate language conditioned on visual context. These models have led to significant advancements in multimodal AI, supporting applications in retrieval, captioning, question answering, referential dialogue, robotics, scientific reasoning, and more. The field has evolved rapidly, consolidating approaches from computer vision, natural language processing, and cognitive science to yield robust, large-scale multimodal systems with tangible real-world impact.
1. Core Architecture and Taxonomy
VLMs consist of three principal components: a visual encoder, an image–text fusion mechanism, and a language decoder (or encoder–decoder) (Ghosh et al., 2024). The visual encoder transforms images (or videos) into dense embeddings, commonly using Vision Transformers (ViT) or ResNet variants. Fusion modules, such as MLPs, Q-Formers, or cross-attention layers, aggregate and align visual and linguistic representations. The language backbone (LLM) consumes the fused embeddings to generate text or, in recent architectures, multimodal outputs.
Recent VLMs are categorized by their input–output modality (Ghosh et al., 2024):
- Vision-Language Understanding (VLU): Dual-encoders trained with contrastive or alignment losses for retrieval and classification (e.g., CLIP, GLIP, VLMo, ImageBind).
- Text Generation with Multimodal Input: Models producing text (answers, captions, dialogue) from image or video input; they interleave a vision encoder with a LLM via fusion modules (e.g., Flamingo, BLIP-2, LLaVA, GPT-4V).
- Multimodal Input/Output ("Any-to-Any" Models): Architectures that both consume and generate arbitrary modalities, sequencing through latent representations and decoding with task-specific diffusion models or transformers (e.g., CoDi, NExT-GPT, Gemini).
These architectures are typically trained on large-scale image–text (or video–text) corpora with objectives such as contrastive InfoNCE loss, cross-entropy for next-token prediction, and auxiliary image–text matching (Ghosh et al., 2024).
2. Foundational Capabilities and Limitations
VLMs excel in zero-shot and few-shot image–text retrieval, flexible captioning, and open-domain question answering (Ghosh et al., 2024, Chrupała, 2021). Large-scale pretraining yields emergent cross-modal reasoning not limited to English and not restricted to text alone, with models such as GPT-4V (Atuhurra et al., 2024) and InstructBLIP achieving state-of-the-art performance on benchmarks like ScienceQA and Flickr30K (Ghosh et al., 2024).
However, VLMs continue to struggle with:
- Fine-grained spatial and compositional reasoning:
- Models underperform on compositional VQA such as WinogroundVQA, achieving group accuracies well below chance (e.g., BLIP: 1.5%) (Pandey, 2023).
- Spatial relation understanding remains deficient; pipeline methods that explicitly decouple noun-phrase grounding from relation inference achieve significantly higher accuracy (top-1 54.04%) compared to holistic ITM baselines (Rajabi et al., 2023).
- Semantic composition in VLMs is poorly developed compared to text-only LLMs; efforts such as Syntactic Neural Module Distillation and cross-modal attention congruence regularization (CACR) have improved but not closed this gap (Pandey, 2023).
- Grounding Reliability:
- Visual grounding remains shallow in video-based VQA, with Acc@GQA as low as 16% despite high answer accuracy (up to ∼70%) (Xiao et al., 2023). Models often succeed using linguistic shortcuts rather than visual evidence.
- Logo recognition reveals strong susceptibility to "semantic entanglement" in the visual projector, resulting in spurious production of brand names for textless logos and highly persistent hallucination even under input perturbations (Li et al., 14 Oct 2025).
- Zero-shot, closed-source VLMs struggle with reliable evidence localization on domain documents (e.g., 1.2% Strict Safety on clinical referrals for Gemini 2.5 Flash) until fine-tuned for grounding (Abioye et al., 25 May 2026).
- Image Classification Deficiency:
- Despite sharing vision backbones, VLMs underperform CLIP and EVA-G by significant margins on standard benchmarks (e.g., GPT-4V ImageNet top-1: 60.6% vs. CLIP-L: 74.8%) (Zhang et al., 2024). The primary factor is the lack of class exposure during multimodal data curation and instruction tuning; fine-tuning on classification data restores performance to parity.
3. Training Regimes, Datasets, and Supervision
VLM performance is deeply linked to the diversity, scale, and balance of multimodal datasets (Ghosh et al., 2024, Chrupała, 2021, Zhang et al., 2024). The field is characterized by several large-scale benchmarks:
- Image–Text: COCO, Flickr30K, ImageNet, Visual Genome, PlantVillage, LogoDet-3K, among others (Chrupała, 2021, Mahmood et al., 9 Apr 2026, Li et al., 14 Oct 2025).
- Video–Text: YouCook2, NExT-GQA, HowTo100M, Video-R1-260K (Xiao et al., 2023, Zhang et al., 6 Apr 2026).
- Speech–Image: Places Audio, SpokenCOCO, Synthetically Spoken COCO/STAIR (Chrupała, 2021, Higy et al., 2020).
- Multilingual and Domain-Specific: Recent work has expanded VLM evaluation to Japanese, Swahili, Urdu, and specialized domains such as medicine (RAPTOR+) and agriculture (AgriChain) (Atuhurra et al., 2024, Abioye et al., 25 May 2026, Mahmood et al., 9 Apr 2026).
Supervision is heterogeneous:
- Contrastive/Retrieval Losses: Standard for aligning modalities (CLIP, BLIP, AVLNet).
- Autoregressive, Cross-Entropy Losses: Dominant in sequence generation tasks (captioning, VQA, referential dialogue) (Ghosh et al., 2024, Willemsen et al., 2023).
- Auxiliary Supervision: Self-supervised objectives (e.g., CACR (Pandey, 2023), cross-modal self-supervised grounding for video (Xiao et al., 2023), fine-tuning with expert-verified chain-of-thought for interpretability (Mahmood et al., 9 Apr 2026)).
- Multimodal RL: RL-based post-training with Group Relative Policy Optimization, crucial for video understanding when paired with data curation filtering for visual grounding (Zhang et al., 6 Apr 2026).
4. Grounding, Interpretability, and Reliability
Grounding—the explicit linkage between outputs and perceptual evidence—is central to VLM trustworthiness across scientific, clinical, and decision-critical settings.
- Explicit Visual Grounding: Architectures such as A3VLM (robotics) and compositional ranking pipelines for spatial reasoning (grounding module + relation module) demonstrate that modular, object-centric or two-stage approaches yield higher performance and interpretability (Huang et al., 2024, Rajabi et al., 2023).
- Grounding Evaluation: Extraction of bounding boxes or rationales and human audits are now routine. RAPTOR+ uses Strict Safety (value correctness + IoU≥0.5) to ensure output localizes verifiable evidence (Qwen3-VL-8B improves from 6.2% to 60.6% Strict Safety after fine-tuning) (Abioye et al., 25 May 2026).
- Reasoning Transparency: VLMs generate self-critique scores (AgriChain, translation/cultural rationales in multilingual captioning (Atuhurra et al., 2024, Mahmood et al., 9 Apr 2026)), and chain-of-thought rationales aligned with expert annotation, significantly increasing both accuracy and human-trust alignment (e.g., CoT-supervised AgriChain-VL3B achieves 73.1% accuracy, +8.1pp over label-only fine-tuning) (Mahmood et al., 9 Apr 2026).
- Failure Modes: Attention analyses and experimentations with logo hallucination (Li et al., 14 Oct 2025), video QA grounding (Xiao et al., 2023), and clinical extraction (Abioye et al., 25 May 2026) have pinpointed embedding subspaces and projectors as loci for spurious grounding. Fine-grained ablations, projector disentanglement, and OCR-guided decoding yield measurable mitigation.
5. Multilingual, Conversational, and Domain-Specific VLMs
VLMs are rapidly extending beyond English and general domains:
- Multilingual Evaluation: New datasets test four-language capabilities (English, Japanese, Swahili, Urdu) with up to 94.8% accuracy in vision-only tasks, though performance dips in translation-challenged domains (e.g., Swahili, 83.6%) (Atuhurra et al., 2024). Rationales facilitate the identification of model biases and translation errors.
- Visually Grounded Dialogue: Pipelines pairing fine-tuned causal LMs for referent description with zero-shot VLM grounding surpass baseline retrieval in visually-grounded dialogue by up to +0.13 accuracy over mention-only baselines (Willemsen et al., 2023).
- Specialized Domains: RAPTOR+ demonstrates that task-specialized fine-tuning enables extraction and strict evidence-localization on clinical documents (e.g., 96.1% Reading Accuracy, 60.6% Strict Safety) (Abioye et al., 25 May 2026). AgriChain shows expert-verified reasoning supervision enables accurate, calibrated, and interpretable disease diagnosis at scale in agricultural images (Mahmood et al., 9 Apr 2026).
6. Best Practices, Open Challenges, and Future Directions
The consensus emerging from recent surveys (Ghosh et al., 2024, Chrupała, 2021) and in-depth studies is:
- Data Quality and Curation: Filtering for visual grounding (VidGround) yields disproportionate gains relative to more complex post-training algorithms (Zhang et al., 6 Apr 2026). High-quality, class-balanced, and multilingual resources are essential for robust performance (Zhang et al., 2024).
- Hybrid and Modular Approaches: Explicit grounding modules, LLM routers for tool selection, and integration of external components such as OCR demonstrably improve cost, performance, and reliability over monolithic, black-box pipelines (Cooper et al., 2024, Li et al., 14 Oct 2025).
- Interpretability and Auditing: Architectures and loss functions that allow for extraction of attention maps, rationales, or bounding boxes support rigorous human evaluation and debugging. Model outputs must be auditable and align with human expert reasoning for deployment in safety-critical workflows (Abioye et al., 25 May 2026, Mahmood et al., 9 Apr 2026).
- Continual and Multitask Learning: Lifelong learning, unlearning, parameter-efficient tuning (e.g., LoRA), and plug-and-play expert modules are key research frontiers for scaling VLMs to new domains and modalities (Ghosh et al., 2024).
Outstanding open problems include compositional semantics mirroring human language in VLMs (Pandey, 2023), deep spatial and temporal visual reasoning (Rajabi et al., 2023, Xiao et al., 2023), low-resource language scalability (Atuhurra et al., 2024), differential grounding for explainability (Li et al., 4 Mar 2026), and robust symbolic vision–language alignment in the face of adversarial or ambiguous input (Li et al., 14 Oct 2025). The integration of causal reasoning, counterfactuals, and embodied perception remains in its infancy.
7. Representative Performance and Key Benchmarks
| Task/Domain | Top Models | Zero-Shot/FT Accuracy (Top-1/F1/Other) | Notable Findings |
|---|---|---|---|
| ScienceQA-IMG | GPT-4V, InstructBLIP | 85%, 63.1% (zero-shot) | PaLM-E, LLaVA: ~70–80% |
| ImageNet Classification | CLIP-L, LLaVA1.5 | 74.8%, 22.8% (zero-shot LLaVA); 84.4% (FT) | Data coverage is primary bottleneck (Zhang et al., 2024) |
| Plant Disease Diagnosis | AgriChain-VL3B | 73.1% (test); macro-F1 0.466 | +17.3pp over Gemini 2.5 Pro |
| Clinical Form Extraction | RAPTOR+ | Reading 96.1%, Strict Safety 60.6% | Zero-shot VLMs: <8% grounding, require FT |
| VideoQA (NExT-GQA) | FrozenBiLM + NG+ | Acc@QA 70.8%, Acc@GQA 17.5% | Grounding lagging by >60 pp |
| Visual Spatial Reasoning | MLP+Rerank Pipeline | Top-1 54.04% vs. MDETR-GQA 45.63% | Explicit grounding gains >8pp |
Further, multilingual VLMs evaluated on four-language tasks achieve up to 94.8% object recognition (GPT-4V/English), with interpretability via rationales facilitating cross-lingual auditing (Atuhurra et al., 2024).
References
- (Chrupała, 2021) "Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques"
- (Pandey, 2023) "Semantic Composition in Visually Grounded LLMs"
- (Rajabi et al., 2023) "Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision LLMs"
- (Xiao et al., 2023) "Can I Trust Your Answer? Visually Grounded Video Question Answering"
- (Willemsen et al., 2023) "Resolving References in Visually-Grounded Dialogue via Text Generation"
- (Ghosh et al., 2024) "Exploring the Frontier of Vision-LLMs: A Survey of Current Methodologies and Future Directions"
- (Zhang et al., 2024) "Why are Visually-Grounded LLMs Bad at Image Classification?"
- (Atuhurra et al., 2024) "Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision LLMs"
- (Huang et al., 2024) "A3VLM: Actionable Articulation-Aware Vision LLM"
- (Li et al., 14 Oct 2025) "Vision LLMs Map Logos to Text via Semantic Entanglement in the Visual Projector"
- (Eppel, 8 Jan 2026) "Coding the Visual World: From Image to Simulation Using Vision LLMs"
- (Li et al., 4 Mar 2026) "DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-LLMs"
- (Zhang et al., 6 Apr 2026) "Watch Before You Answer: Learning from Visually Grounded Post-Training"
- (Mahmood et al., 9 Apr 2026) "AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision LLMs"
- (Abioye et al., 25 May 2026) "RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing"
- (Cooper et al., 2024) "Rethinking VLMs and LLMs for Image Classification"
- (Higy et al., 2020) "Textual Supervision for Visually Grounded Spoken Language Understanding"
VLMs have established themselves as indispensable for multimodal machine reasoning, yet substantial challenges in compositionality, grounding, generalization, and explainability persist. The next generation of research is focused on modularity, interpretability, domain adaptation, and robust evaluation to fulfill the promise of deeply human-like multimodal intelligence.