Tree of Captions: Hierarchical Image Captioning

Updated 5 September 2025

Tree of Captions is a paradigm that decomposes image captioning into a hierarchical structure by decoding components like noun phrases, regions, and prototypes.
It employs models such as phi-LSTM, HIP, and tree-structured prototype networks to generate content-rich captions with improved metrics over conventional methods.
This approach offers robustness, plug-and-play integration, and versatility across applications, ranging from natural scene description to procedural grammar modeling.

The term "Tree of Captions" denotes a class of image captioning models and frameworks that explicitly represent and decode the hierarchical structure of linguistic or visual information in images. Rather than treating captions as linear sequences, these approaches decompose the process into structured subunits—such as noun phrases, semantic prototypes, or image regions—and assemble them in tree-like fashion to form comprehensive descriptions. This paradigm is observed across multiple lines of research, encompassing hierarchical neural models, prototype-based transformer architectures, and theoretically principled captioning frameworks. Collectively, these systems leverage the recursive or compositional dependencies in images and language to enable richer, more accurate, and more flexible captions.

1. Hierarchical Neural Architectures: Phrase-Based and Tree-LSTM Models

Initial advancements in the "Tree of Captions" methodology focused on modeling the temporal and syntactic hierarchy inherent in natural language. The phrase-based hierarchical LSTM (phi-LSTM) model (Tan et al., 2017) embodies this principle by introducing a two-level decoding structure:

Phrase Decoder (Lower Level): Receives a CNN-encoded image and generates variable-length noun phrases (NPs) describing objects and attributes. Each NP is encoded as a compositional vector, specifically the final hidden state of the LSTM over the phrase sequence.
Abbreviated Sentence (AS) Decoder (Upper Level): Operates on a mixed sequence comprising words and NP vectors. The decoder determines at each time step whether to insert a word or an NP (based on binary phrase-indication signals), eventually constructing a full caption by splicing together the outputs.

This approach results in a directed tree-like structure, with phrases corresponding to branches and the sentence root assembling these elements. The inference stage employs dual beam search—first for NP candidates, then for sentence assembly—ensuring both diversity and semantic coverage.

Empirically, phi-LSTM demonstrates superior or competitive performance on Flickr8k, Flickr30k, and MS-COCO, with higher-order BLEU, METEOR, ROUGE-L, and SPICE scores. The model's ability to generate novel, content-rich captions arises from decoupling phrase and sentence modeling, augmenting word-content diversity and reducing redundancy.

2. Hierarchy Parsing and Tree-LSTM Feature Integration

The Hierarchy Parsing (HIP) architecture (Yao et al., 2019) advances the tree-based paradigm by applying hierarchical parsing directly to the image encoder, leveraging visual structure rather than textual alone. HIP constructs a three-level tree:

Instance Level: Mask R-CNN segments foreground objects from regions.
Region Level: Faster R-CNN detects contextual regions; regions are connected into a tree according to Intersection over Union (IoU) criteria.
Image Level: The root node represents the entire image, integrating region and instance children.

A Tree-LSTM network then propagates and refines information in a bottom-up fashion, aggregating node features through input, output, and forget gates operating on child states. The output is a multi-level, context-enriched feature set supplied to neural captioning decoders (attention-based LSTM or GCN-LSTM).

Integration yields notable empirical improvements: CIDEr-D increases from 120.1% to 127.2%, further enhanced to 130.6% with graph convolutional networks (GCNs). The GCN augments region/instance features by modeling semantic relations across nodes, amplifying descriptive accuracy. HIP generalizes as a plug-and-play feature refiner for arbitrary captioning models, illustrating the effectiveness of hierarchical visual parsing.

3. Tree-Structured Prototype Networks and Hierarchical Semantic Alignment

The Progressive Tree-Structured Prototype Network (PTSN) (Zeng et al., 2022) extends tree-based principles to transformer-based, end-to-end captioning systems. Instead of isolated concept embeddings, PTSN models the semantic landscape as a tree of prototypes:

Tree-Structured Prototype Embeddings (TSP): Words are clustered hierarchically (e.g., via k-means), producing coarse-to-fine semantic prototypes (Z₁, Z₂, ... Z_L) that capture relations among concept words.
Progressive Aggregation (PA): Grid features from vision transformers (e.g., Swin Transformer) are progressively fused with tree-level prototypes through stacked Cross-modal Multi-head Attention (CMA) blocks. Features are refined from global (coarse) to local (fine), ensuring alignment with textual hierarchy.

This enables contextually guided word prediction: the tree structure restricts candidate words according to semantic proximity, reducing error propagation from misaligned concept grids. PTSN achieves state-of-the-art CIDEr scores (144.2%, single; 146.5%, ensemble) on MSCOCO Karpathy split—outperforming two-stage and concept-based models due to effective semantic alignment.

4. Information-Theoretic Foundations and Recursive Caption Merging

A formal information-theoretic framework further conceptualizes captioning as encoding in a latent semantic space (Chen et al., 1 May 2024). Good captions are defined by balancing three objectives:

Task Sufficiency: Maximizing mutual information between captions and important image semantic units.
Minimal Redundancy: Reducing caption entropy (brevity).
Human Interpretability: Minimizing statistical divergence from natural language.

The Pyramid of Captions (PoCa) method actualizes a "tree of captions" by integrating local and global information. The process entails:

Splitting the image into patches and generating separate local captions.
Generating a global caption for the whole image.
Merging local and global captions via a text-only LLM, with theoretical guarantees that the combined semantic error is no larger than the global caption alone (via concavity and Jensen’s inequality).

This recursive merging mechanism—splitting, captioning, and reassembling—can be viewed as forming a tree where each node (caption piece) contributes to the final narrative. The system is highly adaptable, with model-agnostic tuning across diverse datasets and architectures, and empirical results demonstrate reduced semantic error and increased descriptive precision.

5. Procedural and Domain-Specific Tree Mapping: L-System Captioning

"Tree of Captions" methodologies are also applied beyond natural images in purely structural contexts such as plant modeling (Magnusson et al., 2023). The L-System Captioning approach reframes tree reconstruction as a sequence-to-sequence mapping:

Image ➔ CNN Encoder ➔ LSTM Decoder: An input image of tree topology is described as an L-System word, a formal grammar encoding tree structure (G = (V, ω, P, π, δ, f)), where V includes turtle graphics symbols {F, +, -, [, ]} and productions define branching.
Heuristic Constraints: Syntactic rules (no cancelling rotations, no empty brackets) yield clean bidirectional mappings.
End-to-End Training: A synthetic dataset is generated and the sequence decoder is optimized with cross-entropy, achieving 80% syntactic correctness under rules, with perplexity 1.129 and bits-per-character 0.403.

This approach obviates intermediate surface or point cloud reconstruction and can infer species-specific grammars from output L-System words, enabling procedural generation entirely from neural captioning.

6. Analysis, Practical Implications, and Extensions

Collectively, "Tree of Captions" approaches capture hierarchical dependencies at multiple levels—syntactic (phrase-based), semantic (concept and prototype-based), visual (region-instance trees), and procedural (structural grammars). These models offer several practical advantages:

Richness and Novelty: Decoupled or tree-oriented encoding injects diverse content by focusing on units (phrases, prototypes, regions) before global assembly.
Robustness and Accuracy: Hierarchical modeling mitigates error accumulation and aligns local detail with global context.
Flexibility and Plug-ability: Architectures such as HIP and PoCa integrate into existing neural captioning models, enhancing descriptive performance without structural re-design.
Theoretical Justification: Formal frameworks provide quantifiable objectives and error guarantees, informing the optimal trade-off between brevity, informativeness, and fluency.
Domain Adaptation: Hierarchical captioning generalizes to specialized domains (scientific figure description, plant modeling).

These contributions establish the "Tree of Captions" paradigm as a pivotal framework for advancing the expressiveness, precision, and semantic richness of image captioning systems, with broad implications for multimodal understanding, accessibility, and scientific communication.

Table: Key Dimensions of "Tree of Captions" Approaches

Model / Framework	Hierarchy Type	Primary Decoding Elements
phi-LSTM (Tan et al., 2017)	Phrase/Sentence	Noun Phrases, Sentence
HIP (Yao et al., 2019)	Visual Instance/Region/Image	Feature Tree Nodes
PTSN (Zeng et al., 2022)	Semantic Prototypes	Hierarchical Concept Clusters
PoCa (Chen et al., 1 May 2024)	Global/Local Recursive	Caption Patches, LLM Fusion
L-System Captioning (Magnusson et al., 2023)	Procedural Grammar	L-System Word (Structure)