Multi-Granularity Captioning

Updated 16 April 2026

Multi-Granularity Captioning is a vision-language approach that generates textual descriptions at multiple levels—from coarse scene summaries to fine-grained part details.
It leverages methods such as scene graph decomposition, multi-encoder transformers, and mask-based fusion to control semantic and spatial granularity.
These techniques improve caption diversity, grounding accuracy, and support practical applications in accessibility, content moderation, and interactive visual search.

Multi-granularity captioning denotes a class of vision-language methodologies that generate textual or structured outputs describing visual data (images or video) at multiple semantic or spatial resolutions, as required for downstream tasks such as dense captioning, grounded reasoning, or user-guided semantic exploration. By producing descriptions at different levels of abstraction or detail—ranging from coarse image-level or scene-level summaries, through object-level or segmented-region descriptions, to fine-grained part/attribute or event-level narratives—these systems enable controllable, diverse, informative, and grounded content generation aligned with the heterogeneity of real-world tasks and user needs. Recent developments emphasize algorithmic mechanisms for explicit granularity control, fusion of multi-scale representations, grounding via masks or superpixels, and scalable, inference-time adaptation.

1. Principles and Taxonomy of Multi-Granularity Captioning

Multi-granularity captioning unifies several related tasks under the requirement that outputs can be produced at multiple, variable granularity levels, often according to explicit user instruction or task-defined schema. Common distinction axes include:

Spatial Granularity: From entire-scene (global, panoptic), through per-object or per-region (instance, part), down to attribute- or pixel-level descriptions. Mask-based and superpixel-based methods provide flexible region reference (Hua et al., 2024, Senior et al., 11 Mar 2025).
Semantic Granularity: Captions may target high-level scene types, object identities, interactions/relations, or detailed properties (color, pose, action).
Temporal Granularity (in video): One caption per video (coarse), per semantically segmented event, or per fine-grained temporal sub-event (Shin et al., 2016, Yan et al., 2022).
Coverage Granularity: Dense captioning (many short phrases per region) versus narrative captioning (few long sentences per overall scene or event).

This paradigm subsumes and generalizes prior tasks such as dense captioning, region-based (referring, compositional) captioning, attribute-aware captioning, and video event captioning. The design challenge is to enable controllability of granularity, signal distinct levels (often by user instruction or prompt tokens), and fuse multi-scale features for generalization.

2. Methodological Frameworks for Multi-Granularity

Modern architectures enabling multi-granularity include scene-graph-based selection (Zhong et al., 2020), transformer-based multi-encoder fusion (Zhao et al., 2019), mask-aware and superpixel-based encodings (Hua et al., 2024, Senior et al., 11 Mar 2025), and inference-time scalable pipelines (Xing et al., 24 Jun 2025).

Scene Graph Decomposition

Captioning via scene graph decomposition samples overlapping semantic subgraphs at various "scales"—from single objects, object pairs, to larger relational groupings—using one-hop neighbor sampling in the detected scene graph. Each subgraph is independently scored and decoded, yielding diverse, granular, and grounded captions that collectively describe the image at multiple semantic levels (Zhong et al., 2020).

Multi-Encoder Transformer Architectures

Approaches such as informative image captioning (Zhao et al., 2019) use multiple parallel encoders (for image regions, object labels, and fine-grained web entities), fusing their outputs via gated multi-headed attention. The architecture allows for learned control over the degree of fine-grained label inclusion, regulated both during training via coverage regression and at inference via "coverage boost" parameters.

Mask- and Superpixel-Based Region Captioning

FINECAPTION (Hua et al., 2024) leverages arbitrary, user-provided or detected segmentation masks at any granularity, encoding them jointly with low- and high-resolution image streams into an LLM using channel-wise fusion. SuperCap further generalizes detector-free region representation via multi-scale superpixels, allowing the model to flexibly attend to coarse or fine-grained visual patches for caption generation (Senior et al., 11 Mar 2025). In all such systems, cross-attention with token-level fusion at the decoder enables the model to dynamically select the spatial/semantic level at which to ground a given caption phrase or token.

Instruction-Driven and Inference-Scalable Systems

Instruction-driven multimodal interfaces such as MGLMM (Zhou et al., 2024) allow user prompts to specify target granularity, which is reflected both in the segmentation mask set (panoptic versus part-level) and in the granularity of the returned captions. Inference-time scalable pipelines (e.g., ScaleCap (Xing et al., 24 Jun 2025)) iteratively refine captions by asking targeted questions for additional object- or position-level details, explicitly controlling the richness (granularity) of the final output via a budget parameter. Sentence-level visual grounding is enforced via contrastive probability scoring.

3. Training Objectives, Losses, and Granularity Control

Losses across systems are tailored to incentivize both fidelity and compliance with granularity constraints:

Cross-Entropy Sequence Loss: Ubiquitous in decoder architectures, minimized over all generated tokens per sentence or per region/subgraph (Zhong et al., 2020, Hua et al., 2024, Senior et al., 11 Mar 2025).
Coverage Regression Losses: Used for controlling inclusion of object/web-entity labels in output, with optional MSE losses guiding the network to match desired coverage rates (Zhao et al., 2019).
Region/Mask Supervision: Losses such as binary cross-entropy and Dice loss align generated segmentation masks with ground-truth regions, especially in segmentation-captioning models (Zhou et al., 2024). IoU thresholds control positive/negative subgraph labeling for training region-level scorers (Zhong et al., 2020).
Reinforcement Learning: Some video captioning methods employ reward-driven RL fine-tuning (e.g., CIDEr-based rewards, metric-weighted discriminative cross-entropy, discrepant-reward RL) to further drive performance on content- or coverage-based evaluation metrics (Yan et al., 2022).
Contrastive Filtering: ScaleCap uses contrastive sentence rating to discard hallucinated or weakly grounded captions based on the difference in decoding probability with and without image input, thus ensuring that retained sentences are visually evidenced (Xing et al., 24 Jun 2025).

Explicit granularity control is realized by designating input prompts (instruction keywords), inference parameters (scale budget, coverage boosting), or by architectural features (gate activations, region proposal filtering).

4. Datasets, Benchmarks, and Quantitative Results

Multi-granularity methods rely on datasets annotated at various scales or with region-level detail:

MGSCData (Zhou et al., 2024): 10k images with mask and caption pairs at both panoptic and fine-grained levels, generated via a three-stage GPT-4o-assisted annotation pipeline. Enables benchmarking for MGSC and related tasks.
CapOnImage2M (Gao et al., 2022): 2.1M e-commerce images, each annotated with 4.8 spatially localized captions, supporting dense and multi-location captioning.
CompositionCap (Hua et al., 2024): Over 200k region-attribute pairs annotated for attribute-aware, region-dense, and global (whole-image) granularities.
COCO Karpathy, MS-COCO, Karpathy split, Flickr30K Entities, MSR-VTT, MSVD: Standard image/video captioning benchmarks extended for multi-granularity evaluation in various works.

Reported results indicate that multi-granularity models improve both coverage/diversity (e.g., Sub-GC increases 1-gram diversity by +29% over prior art (Zhong et al., 2020)) and downstream utility. For example, FINECAPTION-8B achieves a CIDEr of 127.95 for regional captioning (Hua et al., 2024), SuperCap achieves 136.9 CIDEr on COCO Karpathy (Senior et al., 11 Mar 2025), and ScaleCap-enhanced captions yield large gains in VQA tasks when used as text-only inputs at increased scale budget (Xing et al., 24 Jun 2025). Unified data formats and multi-task learning further enable robust transfer across tasks.

5. Applications and User-Centric Adaptation

Practical use cases for multi-granularity captioning span automated content moderation, data annotation, accessibility, VQA, graphic design, and interactive search. Instruction-driven frameworks (Zhou et al., 2024, Xing et al., 24 Jun 2025) allow users to directly modulate the level of description according to task needs or interface constraints—generating summaries, object listings, or fine-grained descriptions on demand. Dense, localized captioning supports e-commerce and social-media image enhancement (Gao et al., 2022). Fine-grained, mask-based referencing aids compositional reasoning and retrieval, especially for multimodal LLMs (Hua et al., 2024).

6. Current Limitations and Open Directions

Despite strong empirical gains, challenges persist. Existing models can be computationally expensive due to multiple region crops (superpixels, masks) and iterative decoding (Senior et al., 11 Mar 2025, Xing et al., 24 Jun 2025). Linguistic hallucination and insufficient visual grounding remain obstacles, addressed in part by contrastive filtering (Xing et al., 24 Jun 2025). The optimal fusion of multi-resolution features, scale-invariant architecture, and harmonization among region-level and global features are active research areas. Dataset biases and annotation bottlenecks motivate the use of automated, LLM-assisted pipelines but raise questions about annotation quality and domain generalization (Hua et al., 2024, Zhou et al., 2024). Mechanisms for explicit cross-resolution attention and learning of fusion weights are proposed as future enhancements (Senior et al., 11 Mar 2025).

7. Comparative Summary Table

Approach/Model	Granularity Axis	Control Mechanism	Key Technical Features
Sub-GC (Zhong et al., 2020)	Semantic/subgraph	Subgraph selection, NMS, scoring	Scene graph decomposition, GCN, LSTM
MGLMM (Zhou et al., 2024)	Segments/masks	Prompt-driven, USCDF markers	CLIP+Vicuna+SAM, unified format
FINECAPTION (Hua et al., 2024)	Mask/attribute	Mask-based input, LLM cross-attn	HL/LL encodings, attribute captions
SuperCap (Senior et al., 11 Mar 2025)	Superpixel-scale	Multi-resolution attention	SLIC, VLM region features
ScaleCap (Xing et al., 24 Jun 2025)	Iterative detail	Inference budget (N), contrastive	Heuristic Q&A, contrastive filtering

Each model provides a distinct yet complementary instantiation of multi-granularity control, diverse region referencing, and granularity-aware caption generation.

Across models and tasks, multi-granularity captioning has emerged as a foundational paradigm for flexible, scalable, and controllable vision-language understanding, offering explicit mechanisms for detail modulation, semantic diversity, and grounded localization across spatial, semantic, and temporal domains. The integration of multi-resolution features, compositional reasoning, and instruction-driven adaptation represents a robust foundation for further research and practical deployment in multimodal AI systems.