GLaMM: Grounding Large Multimodal Models
- The paper introduces a unified architecture that explicitly grounds linguistic expressions to specific image regions using specialized tokens and mask decoders.
- GLaMM integrates a global image encoder, a region encoder, and a frozen SAM module to achieve accurate pixel-level segmentation and high region localization metrics.
- The system demonstrates improved caption quality and segmentation performance, setting a new benchmark in grounded visual-language tasks.
Grounding Large Multimodal Model (GLaMM) refers to a class of large-scale neural architectures that unify language and vision, equipped with the explicit capability to ground linguistic expressions to pixel- or region-level visual evidence within images (and, by extension, other modalities). In contrast to prior LMMs, which produced free-form language without localization, GLaMM systems not only generate text but also output structured visual grounding—delineating exactly where referenced objects, phrases, or regions are found in the input. This paradigm shift enables finely grounded visual dialogue, precise referring expression segmentation, region-level captioning, and supports a range of downstream applications where spatially resolved understanding is critical.
1. GLaMM Architecture and Grounding Mechanisms
GLaMM introduces a unified, modular pipeline for vision-language grounding (Rasheed et al., 2023). The architecture integrates the following key components:
- Global Image Encoder: A frozen CLIP ViT-H/14 backbone extracts holistic image features, projected into the LLM’s embedding space.
- Region Encoder: Using feature pyramids from intermediate CLIP layers, RoIAlign, and a 2-layer MLP, it enables explicit encoding of user-specified or inferred bounding boxes.
- LLM: An autoregressive LLM (Vicuna-7B) receives projected visual tokens (global or regional) and text tokens, generating language sequences intertwined with two special grounding tokens:
<bbox>(regions) and<SEG>(segmentation masks). - Grounding Image Encoder: A frozen Segment Anything Model (SAM) encoder extracts high-resolution pixel features.
- Pixel Decoder: Conditioned on the LLM’s final
<SEG>-token embedding and SAM features, a mask decoder produces binary masks aligned to the grounded phrase.
The data flow supports: (i) holistic image prompts; (ii) injection of region-of-interest via RoIAlign and <bbox> tokens; (iii) natural language instructions interleaved with <SEG> for phrase-level segmentation. At each output step, if a <SEG> token appears, its embedding is decoded into a mask, establishing seamless correspondence between text and visual regions.
2. Dataset Construction and Grounded Supervision
To furnish GLaMM with dense grounding prior, the GranD (Grounding-anything Dataset) corpus was curated: 11M images from SA-1B, yielding 810M segmented regions with 7.5M unique concepts (Rasheed et al., 2023). The pipeline involves:
- Ensemble object detection (CO-DETR, EVA-02, OWL-ViT, POMP) with class-agnostic NMS and attribute extraction (GRiT, GPT4RoI, depth via MiDaS).
- Scene parsing, phrase grounding (spaCy, MDETR).
- Scene-graph assembly and dense, in-context LLM-based captioning (Vicuna-13B).
- Additional refinement via LLMs for context/facts.
A high-quality subset (GranD_f) includes manual and GPT-4–augmented data from Flickr30K, RefCOCOg, and PSG. Fine-tuning on this subset yields only minor gains in mask quality, suggesting the robustness of automated large-scale annotation.
3. Training Paradigms, Objective Functions, and Design Choices
GLaMM optimizes a composite objective encompassing language modeling and pixel-level segmentation:
where is standard cross-entropy for autoregressive generation and is a combination of pixel-wise binary cross-entropy and Dice losses for mask prediction.
Important design insights (Kang et al., 11 Aug 2025) include:
- Prediction Format: Integer coordinates ([x₁, y₁, x₂, y₂]) normalized to [0,100] outperform decimals or location-token schemes for box output.
- Losses: One-hot cross entropy on box/mask tokens induces monotonic token embedding structure (ρ ≈ 0.64 vs. 0.47 under Gaussian).
- Conversation Structure: Deduplicating repeated ground-truth boxes across QA rounds and limiting each conversation to 3 rounds yield superior grounding accuracy on RefCOCO benchmarks.
- Data Mixing: Pure visual grounding (VG) instruction tuning, rather than multitask mixing with VQA, provides better spatial localization, given equal compute.
- Training Regimen: Four epochs for box/mask supervision show optimal trade-off between convergence and performance. These design principles offer generalization across backbones and serve as a blueprint for future GLaMM systems.
4. Evaluation Protocols, Metrics, and Empirical Results
Evaluation of GLaMM centers on the Grounded Conversation Generation (GCG) setting, where models must generate captions with each phrase bracketed and followed by a <SEG> token with a corresponding decoded mask (Rasheed et al., 2023). Benchmarks and corresponding metrics include:
- Caption quality: METEOR, CIDEr.
- Mask quality: class-agnostic mask-IoU, AP50.
- Region grounding: mask recall (IoU>0.5 and BERT-similarity>0.5).
GLaMM achieves METEOR 16.2, CIDEr 47.2, AP50 30.8, mIoU 66.3, and mask recall 41.8 on the GranD validation set, exceeding prior region- and mask-level LMMs (BuboGPT, Kosmos-2, LISA) retrained on the same data. For referring-expression segmentation, GLaMM attains 83.2% on RefCOCO testA, surpassing both LISA and specialized state-of-the-art models.
Ablation studies reveal:
- Removing the region encoder impairs fine-grained, object-targeted queries.
- Dropping the pixel decoder reduces mask recall to less than 10%, underscoring the necessity of dense spatial supervision.
Qualitative inspection confirms accurate grounding of attributes, stuff types, object parts, and spatial relationships, validating GLaMM’s suitability for fine-grained visual-language reasoning.
5. Extensions: Multi-Image, Spatio-Temporal, and Domain-specific Grounding
GLaMM’s architectural paradigm has been adapted for advanced grounding challenges:
- Multi-Image Grounding: Migician (Li et al., 10 Jan 2025) integrates cross-modal transformer attention with fully connected grounding heads and demonstrates scalable free-form multi-image reasoning, outperforming Qwen2-VL-72B on the MIG-Bench suite by 24.94% absolute gain in Acc.
- Spatio-Temporal Grounding: SpaceVLLM (Wang et al., 18 Mar 2025) introduces spatio-temporal aware queries and a query-guided space decoder, jointly optimizing for spatial localization (GIoU, L1 loss) and temporal segmentation. This architecture covers both video and image modalities and excels on STVG and REC tasks.
- Medical Imaging: MedMO (Deria et al., 6 Feb 2026) leverages a ViT–adapter–LLM design with a multi-stage pipeline (contrastive alignment, instruction tuning, RL with factuality plus box-level GIoU rewards), and achieves +40.4 IoU over prior medical MLLMs for disease localization.
- Document and Text-rich Image Grounding: TRIG (Li et al., 7 Apr 2025) and PostAlign (Wu et al., 22 Jun 2025) extend visual grounding to documents and complex textual images, combining instruction tuning, late-stage grounding layers, negative rejection tokens, and selective reasoning heads. This mitigates hallucination and improves interpretability and spatial support correlation.
6. Emergent and Frozen-grounding Approaches
Recent work demonstrates that pixel-level linguistic-visual grounding can emerge even without direct supervision (Cao et al., 2024, Wu et al., 2024). Main findings:
- Attend-and-Segment: Directly mining decoder cross-attention maps from standard LMMs produces token-attention fields that localize noun phrases, which, when coupled with post-processing (e.g., SAM), yield effective segmentation masks. Performance on the GranD GCG test set is competitive (mask recall 44.2, higher METEOR than GLaMM).
- Diffusion-based Visual Encoders: Replacing CLIP encoders with stable diffusion U-Nets (DiffLMM) further enhances implicit localization.
- Frozen LMMs: F-LMM (Wu et al., 2024) shows that grounding abilities can be realized from the inherent attention pattern of LMMs, with only lightweight mask decoders and no finetuning of the main model—thereby perfectly preserving instruction-following performance.
These approaches highlight that instruction-driven and weakly supervised pretraining can enable substantial grounding, reducing the need for labor-intensive per-example box/mask annotation.
7. Limitations, Challenges, and Future Directions
Notwithstanding their empirical strength, GLaMM architectures face several open challenges:
- Label Noise and Dataset Bias: Automated annotation pipelines, while scalable, introduce noisy supervision (Rasheed et al., 2023). Mitigating such noise remains an active area for hybrid (human-in-the-loop) or self-supervised correction (Li et al., 10 Jan 2025).
- Scaling to Video and 3D: Existing systems are predominantly image-based; expansion to temporal or multi-view 3D data requires architectural adaptation (Wang et al., 18 Mar 2025).
- Relational and Abstract Grounding: Composite and relational expressions (“between,” “holding,” “left of”) remain difficult, as standard attention/segmentation routes may not encode inter-object relations robustly (Cao et al., 2024).
- Conversational Trade-offs: Unconstrained fine-tuning can degrade generalist conversational ability (Wu et al., 2024); freezing strategies or modular decoders offer an appealing remedy at the cost of less flexibility.
- Efficiency and Deployability: Training at scale is resource-intensive (Rasheed et al., 2023) and inference with high-resolution masks or video sequences presents efficiency challenges.
Open directions include integrating richer spatial objectives in pretraining, enhancing negative-sample rejection to suppress hallucinations, extending to audio/point cloud modalities via learned query patterns, and leveraging self-supervised emergent grounding from raw multimodal corpora.
In summary, GLaMM defines a new generation of large-scale multimodal systems where every object, attribute, or phrase can be pinpointed at the pixel or region level, tightly coupling language, perception, and spatial reasoning. This capability is foundational to robust vision-language dialogue, explanation, and interaction in open-world, multi-modal settings (Rasheed et al., 2023, Cao et al., 2024, Wu et al., 2024, Kang et al., 11 Aug 2025, Wang et al., 18 Mar 2025, Li et al., 10 Jan 2025, Deria et al., 6 Feb 2026, Li et al., 7 Apr 2025).