Papers
Topics
Authors
Recent
2000 character limit reached

COCO-Tree: Neurosymbolic Compositional Reasoning for VLMs

Updated 20 October 2025
  • COCO-Tree is a compositional reasoning framework that enhances vision-language models by structuring multi-object and relational semantics explicitly.
  • It integrates LLM-generated concept trees with beam search to overcome challenges like attribute binding and role misinterpretation.
  • Benchmark results show 5–10% improvement in compositionality on tasks such as Winoground and EqBench, while providing interpretable reasoning paths.

COCO-Tree refers to a compositional reasoning framework for vision-LLMs (VLMs), introduced to address longstanding deficits in VLMs regarding the systematic understanding of multi-object, multi-attribute, and relational semantics within images. Whereas conventional VLMs excel at recognizing object identities and generating captions, they frequently fail on tasks that require nuanced comprehension of composite concepts—such as relational order, attribute binding, and multi-object interactions. COCO-Tree is distinguished by its integration of a hierarchical neurosymbolic reasoning process, instantiated as a “concept tree” learned from LLMs, and a fusion mechanism that adaptively combines linguistic and visual evidence scores for interpretable prediction and improved compositional generalization (Sinha et al., 13 Oct 2025).

1. Problem Statement and Motivation

Modern open-source VLMs (e.g., LLaVA, InstructBLIP, Qwen, InternVL) show persistent weaknesses in compositional reasoning. This manifests as misinterpretation of role reversals (e.g., subject/object confusion), misbinding of attributes (e.g., color assignment errors), and failures to reason over relational structure. Existing remedies—such as improved prompting, chain of thought, and post-hoc LLM enhancement—are often resource-intensive or lack interpretable traces. COCO-Tree is designed to overcome these limitations, providing both reasoning gains and explicit rationales for predictions by marrying VLMs’ image grounding with LLMs’ linguistic knowledge.

2. System Architecture and Core Algorithms

COCO-Tree comprises two principal computational stages: a standard vision-language inference, supplemented by a hierarchical neurosymbolic reasoning process. The workflow is described as follows:

  • System-1 (VLM Inference): For each image-caption pair (I,C)(I, C), the base VLM computes an initial alignment score f(I,C)f(I, C) encoding how well the visual content matches the textual input.
  • System-2 (Neurosymbolic Reasoning):
  1. Semantic Morphological Decomposition (SMD): An LLM segments CC into a set of entities EE via function FSMDF_{SMD}, extracting functional units (subjects, objects, relational phrases).
  2. Recursive Concept Exploration (RCE): For each entity eEe \in E, the LLM generates candidate binary visual concepts through FRCE(n,C,S)F_{RCE}(n, C, S), constructing the concept tree TCT_C recursively to depth LL (with splitting factor SS).
  3. Composite Vision-Language Scoring: Every node nln^l in tree level ll receives CS(nl)=αLS(nl,C)+(1α)VS(I,nl)C_S(n^l) = \alpha L_S(n^l, C) + (1-\alpha)V_S(I, n^l) where LSL_S (linguistic relevance) is derived from LLM entailment scoring and VSV_S (visual relevance) is from VLM output. α\alpha controls modality mixing (typically 0.5–0.6).
  • Dynamic Path Selection: Reasoning paths p={e,n1,...,nl}p = \{e, n^1, ..., n^l\} are scored and selected by two procedures:
    • SRCH_max (greedy, selecting best by CSC_S)
    • SRCH_Beam (beam search, maintaining top-kk candidate paths at each level and ultimately picking the optimal aggregate scorer)
  • Final Prediction Fusion: The result is computed as O=βf(I,C)+(1β)WpO = \beta f(I, C) + (1-\beta) W_p (β\beta tunes the influence of VLM vs. concept tree), where WpW_p is the score along the selected concept reasoning path.

3. Beam Search-Enhanced Reasoning

COCO-Tree’s reasoning is notably structured around a beam search-inspired mechanism. Rather than committing to the locally optimal child at each decision node (greedy strategy), SRCH_Beam explores multiple high-scoring paths simultaneously through the tree. This mitigates errors from early, non-global choices and enables the model to handle ambiguity in composite captions more robustly. Beam width and tree depth parameters are tunable. Empirical benchmarks confirm that beam search variants of COCO-Tree offer consistent 5–10% compositionality performance gains compared to greedy or vanilla VQAscore approaches.

4. Benchmark Evaluation and Results

COCO-Tree is evaluated on four compositionality benchmarks:

  • Winoground: Tasks that require differentiating reversed subject-object relationships in nearly identical images.
  • EqBench: Sensitive to subtle semantic distinctions (action, location, role).
  • ColorSwap: Assesses binding between adjectives and the correct referents subject to word order.
  • SugarCrepe: Focuses on adversarially constructed captions for fine-grained reasoning.

Experiments across seven open-source VLMs (LLaVA-1.5, LLaVA-1.6, Qwen-7B, InternVL-8B, InstructBLIP-XXL, among others) demonstrate consistent improvements: compositionality group/text/image/overall scores increase by 5–10 percentage points, verified by Wilcoxon signed-rank significance testing (p<0.01p<0.01 for several models), particularly for complex multi-object binding and relational tasks.

5. Interpretability and Rationale Generation

A salient feature is the explicit interpretability of COCO-Tree’s reasoning path. Each concept tree node represents a linguistically transparent entity or relation (e.g., “snake near bird’s beak”, “consuming a snake”). Logical operation semantics are preserved: nodes can be conjunctive (AND rules) or disjunctive (OR rules). Reasoning paths (e.g., “consuming a snake” AND “snake near bird’s beak” \Rightarrow “bird eats snake”) deliver rationale explanations, which are validated by GPT-4 and human annotators for plausibility alignment. This supports human-understandable verification and debugging.

6. Limitations and Comparative Analysis

COCO-Tree is compared with prior methods including scene graph extraction (CCoT) and LLM-based post-processing pipelines. Main advantages:

  • Resource-efficient: avoids exhausting LLM inference or complex external modules.
  • Hierarchical, not flat: enables multi-level semantic composition.
  • Interpretability: explicit, inspectable reasoning traces.

Potential limitations:

  • Increased computation at inference due to tree construction and scoring across multiple candidate paths.
  • Occasional hallucination if LLM introduces nodes not visually present; mitigated by balanced composite scoring.
  • Dependency on LLM and VLM calibration parameters (α\alpha, β\beta, beam width).

7. Extensions and Future Directions

Research directions include:

  • Optimizing neurosymbolic structures and fusion strategies to further enhance compositional performance and reduce runtime.
  • Extending to broader VLM domains (e.g., safety-critical image analysis, industrial inspection, medical imaging).
  • Dynamic adaptation of beam width and selection heuristics based on input complexity.
  • Robustness and generalization studies with more diverse multimodal datasets.
  • Leveraging future generations of LLMs for richer, more accurate morphological decomposition and concept mining.

COCO-Tree establishes a two-stage, neurosymbolic framework for compositional reasoning in vision-LLMs, enabling interpretable, hierarchical reasoning and demonstrable gains on key multimodal benchmarks. Through concept tree construction, composite scoring, and dynamic beam search-based path selection, COCO-Tree sets a paradigm for resource-efficient, explainable, and compositional vision-language understanding (Sinha et al., 13 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to COCO-Tree.