Papers
Topics
Authors
Recent
Search
2000 character limit reached

Question-Guided Chain-of-Captions (QG-CoC)

Updated 10 November 2025
  • The paper introduces a novel pipeline that decomposes complex queries into targeted sub-questions, generating localized captions to improve multi-image reasoning.
  • The methodology integrates question decomposition, sequential caption generation, and answer synthesis using MLLMs to maintain chain continuity.
  • Empirical evaluations demonstrate significant accuracy gains on benchmarks like MUIR and MMIU, validating the effectiveness of structured caption chaining.

Question-Guided Chain-of-Captions (QG-CoC) is a class of methods for multimodal reasoning that orchestrates a structured, question-driven chain of localized captions to facilitate fine-grained perception and inference, particularly in multi-image input regimes for Multimodal LLMs (MLLMs). QG-CoC frameworks systematically decompose complex queries into targeted sub-questions, use these as guides to elicit focused visual captions from each input (image or region), and then integrate these captions and answers in a multi-stage reasoning process. This approach unifies and extends principles from question-guided captioning, chain-of-thought (CoT) prompting, and visual question answering—offering marked improvements over traditional captioning and single-image CoT strategies, especially for tasks involving multi-image synthesis, comparison, and detailed visual reasoning (Kao et al., 5 Nov 2025, Uehara et al., 2024).

1. Motivation and Conceptual Foundations

The challenge addressed by QG-CoC is twofold: the need for (1) fine-grained, task-relevant perception across multiple, disparate images and (2) structured, explicit integration of visual clues for multi-step reasoning. State-of-the-art MLLMs, such as LLaVA, Qwen-VL, GPT-4o, and Gemini-1.5, while highly performant at single-image perception and language understanding, exhibit deficiencies when required to: (a) extract localized details (counting, object identity, spatial features) across >1 image, and (b) integrate these details into holistic multi-image logic chains for complex queries. Existing methods that apply CoT or naive captioning per image oftentimes collapse critical information—either via over-conciseness, over-generalization, or by failing to propagate relevant cues between images and sub-tasks. QG-CoC was developed to address these limitations by enforcing a decomposition-caption-integration pipeline, structurally aligning each step with a sub-aspect of the original query (Kao et al., 5 Nov 2025).

2. Formal Methodology and Pipeline

Given a set of images {I1,...,In}\{I_1, ..., I_n\} and a user question QQ, QG-CoC implements the following pipeline:

  1. Question Decomposition

    • The input question QQ is decomposed by prompting the MLLM:

    (q1,...,qm)=Decompose(Q)(q_1, ..., q_m) = \mathrm{Decompose}(Q)

  • Sub-questions correspond to sub-aspects (object, relation, attribute, action) necessary for answering QQ.
  1. Question-Guided Caption Generation (Chain Construction)

    • For each sub-question qjq_j and for each image IkI_k:

    cj,k=CaptionModel(Ik,qj,cj,1:k1)c_{j,k} = \mathrm{CaptionModel}(I_k, q_j, c_{j,1:k-1})

  • cj,1c_{j,1} is conditioned on (I1,qj)(I_1, q_j); for k>1k>1, cj,kc_{j,k} is additionally conditioned on prior captions for that sub-question, i.e. cj,1:k1c_{j,1:k-1}, enforcing chain continuity.
  • The process is repeated for all sub-questions.
  1. Sub-Question Answering and Integration

    • For each qjq_j, the MLLM is prompted:

    "Sub-question: qj; Captions: cj,1,...,cj,n. Provide a concise answer."\text{"Sub-question: }q_j\text{; Captions: }c_{j,1},...,c_{j,n}\text{. Provide a concise answer."}

    yielding answers aja_j. - Final answer synthesis invokes:

    "Given: 1. q1a1,... qmam, answer the original question Q."\text{"Given: 1. }q_1 \rightarrow a_1, ...\text{ }q_m \rightarrow a_m\text{, answer the original question }Q."

This pipeline is executed strictly in zero-shot (no parameter updates), with sample parameters: temperature =0=0, context window \leq 2048 tokens, and no specialized tokens or finetuning (Kao et al., 5 Nov 2025).

3. Core Algorithms and Model Architecture

The canonical QG-CoC method is conceptualized as an orchestration of text prompts—no specialized model architecture is required. However, related implementations (such as (Uehara et al., 2024)) incorporate architectural innovations for single-image chains:

  • Image Backbone: Pretrained vision encoder (e.g., CLIP ViT-Large) for global/regional features. In dual-input designs, both global image and masked RoI representations are encoded.
  • Q-Former Adapter: A trainable Transformer adapter, receiving image embeddings and supplying "query tokens" that encourage region-aware grounding, mirroring BLIP-2.
  • Text Decoder: LLM (e.g., LLaMA-2-chat-13B), receiving both Q-Former outputs and projected image features. Generates sequences including reasoning steps, uncertainty scalars, and, when needed, explicit question/answer pairs as part of the chain.
  • Formal Generation Factorization:

P(C1:T,Q1:T1,A1:T1I)=i=1T1[P(QiI,C1:i)P(AiI,C1:i,Qi) P(Ci+1I,C1:i,Q1:i,A1:i)]P(C1I)\begin{align*} P(C_{1:T}, Q_{1:T-1}, A_{1:T-1} | I) =& \prod_{i=1}^{T-1} \Big[ P(Q_i|I, C_{1:i}) \cdot P(A_i | I, C_{1:i}, Q_i) \ & \cdot P(C_{i+1} | I, C_{1:i}, Q_{1:i}, A_{1:i}) \Big] \cdot P(C_1|I) \end{align*}

where CtC_t are captions ("reasoning steps"), QiQ_i are generated questions, and AiA_i their answers.

4. Prompt Engineering and Inference Procedure

All prompt templates leverage zero-shot instructions with deterministic decoding. The inference process (for multi-image QG-CoC) is as follows:

1
2
3
4
5
6
7
8
9
10
11
prompt1 = "You are given a question: '{Q}'. Decompose it into a numbered list of clear sub-questions..."
(q1,...,qm) = MLLM(prompt1)
for j in range(1, m+1):
    for k in range(1, n+1):
        prompt2 = f"Here is Image {k}: <Image {k}>. Sub-question: '{q_j}'. Provide a detailed caption..."
        c_jk = MLLM(prompt2)
for j in range(1, m+1):
    prompt3 = f"Sub-question: '{q_j}'. Captions: {c_{j,1}},...,{c_{j,n}}. Provide a concise answer."
    a_j = MLLM(prompt3)
prompt4 = f"Sub-questions and answers: ... Based on these, answer the original question: '{Q}'."
final_answer = MLLM(prompt4)
All models use do_sample=False, temperature=0, and a max context of 2 048 tokens (Kao et al., 5 Nov 2025).

5. Empirical Performance and Benchmarking

QG-CoC has been evaluated across multi-image benchmarks (MUIR, MMIU, MuirBench) and single-image generalization tasks (MMMU, MMBench, ScienceQA). The principal metric is answer accuracy (% correct; not BLEU/CIDEr). Main quantitative findings include (Kao et al., 5 Nov 2025):

Model Method MUIR MMIU ScienceQA MMMU MMBench
LLaVA-OV w/o prompt 41.2 44.6 94.5 45.4 85.1
QG-CoC 53.3 50.9 94.5 48.9 87.6
Qwen-2.5-VL w/o prompt 62.1 50.3 90.2 58.2 88.2
QG-CoC 65.3 56.9 91.9 64.8 89.4
GPT-4o w/o prompt 70.8 63.3 89.5 63.1 86.0
QG-CoC 74.9 65.8 90.3 66.7 88.9

Ablations demonstrate that incremental gains are attributable both to the decomposition and targeted captioning phases, with cumulative improvement on challenging multi-image settings (e.g., up to +12 points on MUIR for LLaVA-OV) (Kao et al., 5 Nov 2025). Error analysis on 120 cases revealed that errors are split between misunderstanding the decomposed sub-tasks (33.3%), perception failures (31.7%), and reasoning mistakes post caption extraction (35.0%).

6. Comparative Analysis and Limitations

QG-CoC contrasts with and supersedes naïve per-image captioning, single-image CoT, and prior "chain-of-captions" variants that do not enforce question-guided focus or proper chaining. Notable limitations include:

  • Scaling bottlenecks: The approach requires chained prompts per (sub-question, image) pair and can strain the context window for large nn or mm.
  • Model dependency: Reliance on the MLLM's captioning and reasoning abilities; subpar MLLMs can diminish the method's advantage (Kao et al., 5 Nov 2025).
  • Explicit knowledge integration: The QG-CoC pipeline remains "model-agnostic" and does not integrate external reasoning tools or explicit spatial/geometric modules—future research directions proposed include hybridizing with tool augmentation and mixed-modal streams.

7. Applications and Illustrative Example

QG-CoC is designed primarily for multi-image reasoning benchmarks, such as comparison, temporal/spatial synthesis, and detailed scene understanding tasks. A representative example involves tabular image comparison for entity matching (Kao et al., 5 Nov 2025): When asked for affiliations of authors shown in three images, naïve captioning yielded vague summaries, whereas QG-CoC decomposed the question by row and entity, producing pointed captions (e.g., "Row 1: Author 'Xu' is from 'Stanford'; Author 'Lee' is from 'MIT'."), directly enabling correct matching in the final answer. For single-image tasks, QG-CoC retains or slightly improves performance over baseline CoT or captioning methods (Kao et al., 5 Nov 2025, Uehara et al., 2024).


QG-CoC establishes a structured, question-driven pipeline for fine-grained multimodal reasoning, empirically validated by accuracy gains on multi-image tasks across both open-source and proprietary MLLMs. It directly addresses perception-reasoning integration deficits in current models and serves as a robust foundation for future multimodal research spanning images, text, and other modalities (Kao et al., 5 Nov 2025, Uehara et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Question-Guided Chain-of-Captions (QG-CoC).