Grounded Captioning
- Grounded captioning is a task that generates natural language descriptions with explicit spatial or spatio-temporal grounding for each entity in an image or video.
- It employs methodologies such as two-stage pipelines, one-stage architectures, graph-based consensus, and controllable frameworks to align textual output with visual inputs.
- Applications include image and video captioning, supporting vision–language reasoning through verifiable localization and robust evaluation metrics.
Grounded captioning refers to the task of generating natural language descriptions for images or videos in which each textual reference to an object (or in some frameworks, attribute or relation) is explicitly associated with its spatial or spatio-temporal location in the visual input. This association, or “grounding,” is typically realized via bounding boxes, segmentation masks, or similar localization constructs, enabling the model not only to describe but also to localize every groundable entity mentioned in the caption. Grounded captioning is central to verifiable multi-modal understanding, as it mitigates hallucination (mentioning unobserved entities), enhances interpretability, and facilitates downstream vision–language reasoning.
1. Foundational Problem Statement and Motivation
Grounded captioning operates at the intersection of two classical computer vision tasks: visual localization (object detection, segmentation) and image/video captioning. The core definition goes beyond “captioning” by requiring region- or object-level grounding, which provides:
- Disambiguation: By associating text spans (typically noun phrases) with explicit visual support, grounded captioning forces the model to resolve referential ambiguity and avoid generic or hallucinated content (Deng et al., 4 Feb 2025).
- Comprehensive scene modeling: Region-level grounding encourages coverage of both objects (“things”) and amorphous background elements (“stuff”) (Deng et al., 4 Feb 2025).
- Interpretability and auditability: External users can inspect model predictions for both language and localization, enabling objective evaluation and failure analysis (Zhou et al., 2020).
In video captioning, grounded approaches are further required to provide temporally consistent tracks (object tubes) for each referenced entity, introducing additional complexity (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025).
2. Core Methodological Paradigms
2.1 Two-Stage Detection–Captioning Pipelines
The canonical approach is a two-stage bottom-up pipeline: (1) extract region features using a detector (e.g., Faster-RCNN), (2) generate captions while leveraging region-level attention for grounding (Zhou et al., 2020, Variš et al., 2020). Attention weights are used at generation time to associate words with regions, but this can yield overfitting to detection proposals and insufficient exploitation of inter-object context (Cai et al., 2023).
2.2 Top-Down and One-Stage Architectures
To address entanglement and overfitting in region-based proposals, recent frameworks encode the raw image directly via patch embeddings (e.g., Vision Transformer) and employ decoder architectures that link language generation to spatial attention distributions at full image resolution. The “Top-Down” model introduces a Recurrent Grounding Module (RGM) which sharpens grounding maps at each decoding step by conditioning on previous localization, and an explicit relation module for capturing relational context (Cai et al., 2023).
2.3 Graph-Based and Consensus Approaches
Scene graph–driven frameworks extract and align visual and linguistic graphs to construct consensus representations that encode only the semantically consistent nodes and edges (entities, attributes, relations) supported by both modalities. Adversarial alignment is used to ensure congruence between visual and textual semantics, with the result injected into the caption decoder to regularize generation and further mitigate hallucination (Zhang et al., 2021, Zhang et al., 2021).
2.4 Explicit Control and Controllability
Controllable grounded captioning frameworks accept user-specified sequences or sets of regions as control signals. The decoder is compelled to generate chunks of text explicitly tied (and often ordered/aligned) to these inputs, via a hybrid LSTM with gating for chunk transitions and sentinel vectors to distinguish between visual and non-visual words (Cornia et al., 2018, Basioti et al., 2024).
2.5 Weakly and Unsupervised Training
In the absence of region–phrase supervision, weakly supervised methods leverage only image–caption pairs. Strategies include cyclical reconstruction (localize–reconstruct cycles) (Ma et al., 2019), distributed multi-head attention to cover complete objects (“partial grounding” alleviation) (Chen et al., 2021), and knowledge distillation from image-text matching networks that can provide pseudo-alignment without explicit box supervision (Zhou et al., 2020).
3. Model Components, Losses, and Training Strategies
| Component | Purpose in Grounded Captioning | Representative Papers |
|---|---|---|
| Visual Feature Encoder | Extract region or patch features | (Cai et al., 2023Zhou et al., 2020) |
| Caption Decoder (LSTM/Transformer) | Generate words and attend over visual encodings | (Cai et al., 2023Deng et al., 4 Feb 2025) |
| Attention/Localization Module | Assign localization at each step | (Ma et al., 2019Chen et al., 2021) |
| Relational/Consensus Module | Encode inter-entity relations and global context | (Zhang et al., 2021Zhang et al., 2021) |
| Cycle Consistency/Reconstruction | Regularize attention/localization in absence of boxes | (Ma et al., 2019) |
| Weakly-Supervised/Distillation Losses | Provide pseudo-alignment signals | (Zhou et al., 2020) |
The predominant training objective is a composite of language modeling loss (cross-entropy) and grounding/localization losses (cross-entropy or contrastive), with additional terms for consensus/graph alignment or multi-label classification for relations (Cai et al., 2023, Zhang et al., 2021). Reinforcement learning with caption–grounding combined rewards is also deployed for improved alignment (Cornia et al., 2018).
4. Evaluation Protocols and Benchmarking
Grounded captioning benchmarks require evaluation of both language and localization quality, typically on datasets with region–phrase (image) or frame–object–phrase (video) annotations.
- Captioning Metrics: BLEU, METEOR, CIDEr, ROUGE, SPICE, CAPTURE (Deng et al., 4 Feb 2025, Oliveira et al., 19 Feb 2025).
- Grounding Metrics: F1_all (joint object and localization), F1_loc (localization only), AP50 (average precision at IoU ≥ 0.5), mIoU (mean intersection over union), Recall (Cai et al., 2023, Kazakos et al., 2024, Deng et al., 4 Feb 2025).
- Hallucination: CHAIR_i/s indices track the frequency of hallucinated (not present) objects in output (Zhang et al., 2021, Zhang et al., 2021).
- Controllability Metrics: Needleman–Wunsch chunk alignment, set-IoU (Hungarian-matched noun coverage), distinct-n (for diversity), length-precision, GRUEN (well-formedness) (Cornia et al., 2018, Basioti et al., 2024).
- Datasets: Flickr30k Entities, COCO Entities, COCONut-PanCap, GroundCap, iGround (for video), HowToGround (large-scale pretrain), GROC (video), MSC (marine), Visual Genome (Deng et al., 4 Feb 2025, Oliveira et al., 19 Feb 2025, Kazakos et al., 13 Mar 2025, Kazakos et al., 2024, Truong et al., 6 Aug 2025, Yin et al., 2019).
5. Advances in Video and Multimodal Grounded Captioning
Recent progress extends grounded captioning to the video domain, requiring temporally consistent object “tubes” for all captioned entities and capturing temporal objectness (disappearance, occlusion, re-appearance) (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025). Key architectural innovations include:
- Dual visual encoders for global and per-frame features (e.g., CLIP-L, SAM), with temporal adapters for efficient temporal modeling (Kazakos et al., 2024).
- LLMs as backbone decoders for generating captions and tagging noun phrases with temporal spans (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025).
- Joint optimization of language modeling, spatial localization (gIoU/L1 loss), and temporal objectness (binary cross-entropy) (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025).
- Large-scale pre-training: pipelines for automatic dense annotation (HowToGround1M) followed by fine-tuning on high-quality manual data (iGround) yield state-of-the-art results across ActivityNet-Entities, VidSTG, and GROC (Kazakos et al., 13 Mar 2025).
6. Datasets, Annotation Protocols, and Evaluation Tools
The proliferation of large-scale region-level and panoptic grounding datasets has significantly shaped the field. Notable contributions include:
- COCONut-PanCap: Panoptic masks for both “thing” and “stuff” classes, covering >13 regions per image, with detailed (mean 203 words/image) region-aware captions (Deng et al., 4 Feb 2025).
- GroundCap: Movie-derived set with persistent instance IDs for object/action tracking and explicit tag structure for groundable text, enabling verifiable reference consistency and action-object linking (Oliveira et al., 19 Feb 2025).
- CIC-BART-SSA: AMR-based graph augmentation for systematically filling the data gap in low-coverage, highly focused groundable regions, supporting fine-grained controllability (Basioti et al., 2024).
- Marine and specialized domains: MSC (marine wildlife) exemplifies grounded captioning in challenging, domain-specific scenarios with per-frame mask refinement and domain expert validation (Truong et al., 6 Aug 2025).
- Video: iGround (manual), HowToGround (auto), GROC (manual video) provide spatio-temporal coverage for video-level grounding (Kazakos et al., 13 Mar 2025, Kazakos et al., 2024).
Metric innovations such as gMETEOR (harmonic mean of grounding F1 and METEOR) have also been introduced to simultaneously capture both localization and language quality (Oliveira et al., 19 Feb 2025).
7. Limitations, Open Challenges, and Future Directions
Despite substantial innovation, several unresolved challenges persist:
- Complex relations and scene-level grounding: Existing models often ground only objects or noun phrases, with less robust support for predicates, attributes, or complex relational graph structures, particularly in video (Zhang et al., 2021, Cai et al., 2023).
- Partial object grounding and coverage: Standard attention mechanisms can yield partial groundings; distributed/multi-head or cycle-consistency strategies help but do not universally solve coverage (Chen et al., 2021, Ma et al., 2019).
- Annotation bottlenecks and scalability: Human annotation is costly; synthetic and LLM-based augmentation pipelines, as used in COCONut-PanCap and HowToGround, provide scale but may require further validation for high-fidelity tasks (Deng et al., 4 Feb 2025, Kazakos et al., 13 Mar 2025).
- Zero-shot and open-vocabulary grounding: The ability to generalize grounding to unseen categories or complex attributes remains a core ambition (Deng et al., 4 Feb 2025, Basioti et al., 2024).
- Unified segmentation/captioning and cross-modal transfer: Joint panoptic-grounded captioning/segmentation and transfer to downstream tasks (VQA, referring expression segmentation, text-to-image generation) are active areas of research (Deng et al., 4 Feb 2025).
- Temporal consistency and multi-frame reasoning: In video, stable object tracking, handling occlusions, and attribute persistence across frames are substantial open issues (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025).
Anticipated future directions include reinforcement learning–driven VLAM refinement (Cai et al., 2023), integration of richer scene graphs or external knowledge, extension to longer narratives and multi-granularity or part-level grounding, and adaptation to out-of-domain scenarios (egocentric, medical, underwater) (Deng et al., 4 Feb 2025, Truong et al., 6 Aug 2025).
Grounded captioning has evolved from auxiliary attention visualization in neural captioners to a formalized, multi-component field with rigorous evaluation, large-scale benchmarks, explicit methodological advances, and clear pathways to further integration of spatial, semantic, and temporal grounding in multi-modal artificial intelligence.