Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tree-of-Captions: Hierarchical Image Captioning

Updated 17 January 2026
  • Tree-of-Captions is a hierarchical approach to image captioning that organizes image features and textual descriptions into tree structures for detailed and coherent narrative generation.
  • It integrates bottom-up visual parsing and tree-structured semantic prototype aggregation to capture multi-scale context and compositional relationships.
  • Advanced methods such as Monte Carlo Tree Search enable top-down sequential planning, enhancing caption coherence while reducing redundancy.

Tree-of-Captions refers to hierarchical architectures and search paradigms for image captioning that organize either image representations, semantic prototypes, or candidate textual descriptions into tree-structured forms for improved detail, compositionality, and narrative coherence. In state-of-the-art research, this term encompasses several lines of work: bottom-up hierarchical visual parsing for semantic feature enrichment (Yao et al., 2019), progressive cross-modal aggregation using tree-structured semantic prototypes (Zeng et al., 2022), and top-down MDP-based planning over candidate caption trees (Zhang et al., 25 Oct 2025). These methods advance image captioning by enforcing explicit multi-level context modeling, structured attention, and refined sequential search over possible natural-language utterances.

1. Hierarchical Visual Parsing and Encoding

Tree-of-Captions architectures frequently commence with multi-level decomposition of an input image. In “Hierarchy Parsing for Image Captioning” (Yao et al., 2019), the image is processed using Faster R-CNN (ResNet-101 backbone) for object region detection, followed by Mask R-CNN for obtaining binary foreground instance masks. For each detected region rir_i, a corresponding instance mim_i is created by multiplying the region’s feature map with its segmentation mask, then re-extracting features using another Faster R-CNN. The resulting hierarchy is structured as follows: the root node represents the entire image; the intermediate layer consists of region nodes; the leaf layer comprises instances, each attached to its parent region.

To model possible region-to-region nesting (multi-scale subregions), regions are sorted by area, and each region rir_i is attached to another region (parent node) with intersection-over-union (IoU) exceeding a threshold ϵ=0.1\epsilon=0.1; otherwise, rir_i attaches to the root. Each instance mim_i is a leaf of its corresponding region rir_i, establishing an explicit three-level tree over the visual domain.

2. Tree-Structured Semantic Prototypes in Captioning

The Progressive Tree-Structured Prototype Network (PTSN) (Zeng et al., 2022) introduces hierarchical modeling in the textual semantic space for image captioning. The approach begins by clustering word embeddings (nouns, verbs, adjectives) into coarse semantic groups using K-means. Level =1\ell=1 prototypes are formed via clustering of {xc}\{x_c\} into F1F_1 clusters; higher-level prototypes are recursively constructed by K-means over previous level centroids. This produces a tree {p,i}\{p_{\ell,i}\} of semantic prototypes, capturing hierarchical relationships among concepts.

Progressive aggregation is performed via stacked cross-attention blocks, each aligning image grid features GG_\ell to prototype sets PP_\ell. This sequence narrows the semantic scope from coarse to fine, as each attention block restricts focus to finer concept candidates, culminating in visual memory that is semantically structured. Caption decoding proceeds with a standard Transformer, using the progressively refined visual features as memory. Training uses cross-entropy followed by self-critical sequence training (RL) with CIDEr reward.

In “Top-Down Semantic Refinement for Image Captioning” (Zhang et al., 25 Oct 2025), image captioning is recast as a hierarchical refinement planning problem. The generation process is formalized as a Markov Decision Process (S,A,P,R)(\mathcal S, \mathcal A, \mathcal P, \mathcal R), with states sts_t as partial captions, actions ata_t as token extensions, and deterministic transitions st+1=stats_{t+1}=s_t \oplus a_t. The reward function R(sT)R(s_T) penalizes redundancy and incentivizes caption depth, combining CLIP/VLM-based quality scores, a length incentive αlog(1+sT)\alpha\log(1 + |s_T|), and n-gram repetition penalties.

A Monte Carlo Tree Search (MCTS) algorithm is used for sequential planning: nodes represent partial captions, branches correspond to possible next tokens or region-specific prompts, and leaves to candidate completed captions. Visual-guided parallel expansion is achieved by identifying kk salient regions and batching VLM queries for region-focused prompts, producing policy vectors p(i)p^{(i)} and corresponding value predictions vvlm(i)v^{(i)}_{\mathrm{vlm}}. Simulations use a lightweight transformer-based value network to efficiently estimate expected rewards. Combined value propagation and adaptive early stopping guarantee computational efficiency and adaptive refinement depth.

4. Feature Aggregation and Message Passing Mechanisms

Bottom-up aggregation in HIP (Yao et al., 2019) employs Tree-LSTM to encode hierarchical dependencies, utilizing forget gates and input gating across children nodes. At each node jj, the hidden state hjh_j and cell state cjc_j are recursively updated, integrating the sum of children's hidden states h~j\widetilde h_j. The root’s hidden output hrooth_{\text{root}} aggregates global image context, and enriched features rihr_i^h, mihm_i^h propagate up the hierarchy. Message passing yields multi-level features used in top-down attention LSTM decoders.

PTSN (Zeng et al., 2022) progressively fuses prototype semantics into visual grid features, cross-attending over textual prototype sets at each level. This mechanism constrains generation to semantically coherent candidate tokens, facilitating composition.

In TDSR (Zhang et al., 25 Oct 2025), MCTS-driven expansions propagate value estimates and search statistics; adaptive mechanisms control the number of tree exploration iterations per step, balancing detail and efficiency.

5. Decoding, Attention, and Caption Generation

In HIP (Yao et al., 2019), a two-layer top-down attention LSTM decoder replaces raw image input with tree-enriched features (Ih,rˉ,mˉ)(I^h, \bar r, \bar m). At each decoding step, the model concatenates word embeddings, prior hidden states, and global plus local features, computes attention weights over region/instance vectors vi=[rih;ri;mi]v_i = [r_i^h; r_i; m_i], and produces attended features v^t\hat v_t. Decoding proceeds with region-aware attention, followed by next-word prediction via softmax over output LSTM hidden states.

PTSN (Zeng et al., 2022) uses Transformer-based decoding over progressively aggregated visual memory, with caption probabilities pθ(sts1:t1,I)p_\theta(s_t|s_{1:t-1},I) obtained via softmax projections of the decoder outputs.

TDSR (Zhang et al., 25 Oct 2025) selects next-word expansions via MCTS, leveraging VLM priors and value network predictions. Caption growth proceeds in a hierarchical, region-conditioned manner, terminating upon reaching an end-of-sequence token or adaptive search threshold.

6. Quantitative Performance and Ablation Analysis

HIP (Yao et al., 2019) improves COCO captioning metrics: Up-Down+HIP achieves CIDEr-D=127.2, compared to 120.1 for the baseline; GCN-LSTM+HIP reaches 130.6 on the Karpathy split, outperforming prior approaches. Ablations reveal that joint use of region, instance, and hierarchical features yields the best captioning scores.

PTSN (Zeng et al., 2022) sets state-of-the-art CIDEr scores: 144.2 for single models and 146.5 for ensembles on the Karpathy split, with 141.4 (c5) and 143.9 (c40) on the MSCOCO online test server.

TDSR (Zhang et al., 25 Oct 2025) delivers substantial gains: On DetailCaps, TDSR boosts LLaVA-1.5’s CAPTURE from 49.99 to 66.7 and Qwen2.5-VL’s from 64.7 to 72.2; on COMPOSITIONCAP, CIDEr improves from 86.5 to 124.2 (LLaVA-1.5) and 120.3 to 129.4 (Qwen2.5-VL), indicating enhanced compositional generalization. POPE evaluations demonstrate significant hallucination suppression under adversarial and random perturbations, outperforming contemporary baselines.

Efficiency ablations confirm that TDSR’s MCTS planner reduces VLM call frequency by an order of magnitude with only marginal loss in captioning quality; disabling parallel expansion or adaptive stopping increases latency substantially.

7. Significance and Implications

Tree-of-Captions frameworks explicitly encode hierarchical context—either through bottom-up structured visual parsing, tree-based semantic prototype aggregation, or top-down MCTS-driven search—yielding captions with greater detail, compositionality, and robustness. These paradigms unify multi-scale visual analysis and sequential natural language planning, overcoming the deficit of global coherence and local precision present in flat captioning models. Their empirical superiority on benchmarks demonstrates the value of hierarchical structure in both feature representation and generative search for image captioning. A plausible implication is that further integration of tree-based planning with high-capacity VLMs and compositional reward modeling will continue to advance the quality and reliability of scene description systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tree-of-Captions.