Question-Guided Chain-of-Captions (QG-CoC)

Updated 10 November 2025

The paper introduces a novel pipeline that decomposes complex queries into targeted sub-questions, generating localized captions to improve multi-image reasoning.
The methodology integrates question decomposition, sequential caption generation, and answer synthesis using MLLMs to maintain chain continuity.
Empirical evaluations demonstrate significant accuracy gains on benchmarks like MUIR and MMIU, validating the effectiveness of structured caption chaining.

Question-Guided Chain-of-Captions (QG-CoC) is a class of methods for multimodal reasoning that orchestrates a structured, question-driven chain of localized captions to facilitate fine-grained perception and inference, particularly in multi-image input regimes for Multimodal LLMs (MLLMs). QG-CoC frameworks systematically decompose complex queries into targeted sub-questions, use these as guides to elicit focused visual captions from each input (image or region), and then integrate these captions and answers in a multi-stage reasoning process. This approach unifies and extends principles from question-guided captioning, chain-of-thought (CoT) prompting, and visual question answering—offering marked improvements over traditional captioning and single-image CoT strategies, especially for tasks involving multi-image synthesis, comparison, and detailed visual reasoning (Kao et al., 5 Nov 2025, Uehara et al., 18 Jan 2024).

1. Motivation and Conceptual Foundations

The challenge addressed by QG-CoC is twofold: the need for (1) fine-grained, task-relevant perception across multiple, disparate images and (2) structured, explicit integration of visual clues for multi-step reasoning. State-of-the-art MLLMs, such as LLaVA, Qwen-VL, GPT-4o, and Gemini-1.5, while highly performant at single-image perception and language understanding, exhibit deficiencies when required to: (a) extract localized details (counting, object identity, spatial features) across >1 image, and (b) integrate these details into holistic multi-image logic chains for complex queries. Existing methods that apply CoT or naive captioning per image oftentimes collapse critical information—either via over-conciseness, over-generalization, or by failing to propagate relevant cues between images and sub-tasks. QG-CoC was developed to address these limitations by enforcing a decomposition-caption-integration pipeline, structurally aligning each step with a sub-aspect of the original query (Kao et al., 5 Nov 2025).

2. Formal Methodology and Pipeline

Given a set of images $\{I_1, ..., I_n\}$ and a user question $Q$ , QG-CoC implements the following pipeline:

Question Decomposition
- The input question $Q$ is decomposed by prompting the MLLM:
$(q_1, ..., q_m) = \mathrm{Decompose}(Q)$

Sub-questions correspond to sub-aspects (object, relation, attribute, action) necessary for answering $Q$ .

Question-Guided Caption Generation (Chain Construction)
- For each sub-question $q_j$ and for each image $I_k$ :
$c_{j,k} = \mathrm{CaptionModel}(I_k, q_j, c_{j,1:k-1})$

$c_{j,1}$ is conditioned on $(I_1, q_j)$ ; for $k>1$ , $c_{j,k}$ is additionally conditioned on prior captions for that sub-question, i.e. $c_{j,1:k-1}$ , enforcing chain continuity.
The process is repeated for all sub-questions.

Sub-Question Answering and Integration
- For each $q_j$ , the MLLM is prompted:
$\text{"Sub-question: }q_j\text{; Captions: }c_{j,1},...,c_{j,n}\text{. Provide a concise answer."}$

yielding answers $a_j$ . - Final answer synthesis invokes:

$\text{"Given: 1. }q_1 \rightarrow a_1, ...\text{ }q_m \rightarrow a_m\text{, answer the original question }Q."$

This pipeline is executed strictly in zero-shot (no parameter updates), with sample parameters: temperature $=0$ , context window $\leq$ 2048 tokens, and no specialized tokens or finetuning (Kao et al., 5 Nov 2025).

3. Core Algorithms and Model Architecture

The canonical QG-CoC method is conceptualized as an orchestration of text prompts—no specialized model architecture is required. However, related implementations (such as (Uehara et al., 18 Jan 2024)) incorporate architectural innovations for single-image chains:

Image Backbone: Pretrained vision encoder (e.g., CLIP ViT-Large) for global/regional features. In dual-input designs, both global image and masked RoI representations are encoded.
Q-Former Adapter: A trainable Transformer adapter, receiving image embeddings and supplying "query tokens" that encourage region-aware grounding, mirroring BLIP-2.
Text Decoder: LLM (e.g., LLaMA-2-chat-13B), receiving both Q-Former outputs and projected image features. Generates sequences including reasoning steps, uncertainty scalars, and, when needed, explicit question/answer pairs as part of the chain.
Formal Generation Factorization:

$\begin{align*} P(C_{1:T}, Q_{1:T-1}, A_{1:T-1} | I) =& \prod_{i=1}^{T-1} \Big[ P(Q_i|I, C_{1:i}) \cdot P(A_i | I, C_{1:i}, Q_i) \ & \cdot P(C_{i+1} | I, C_{1:i}, Q_{1:i}, A_{1:i}) \Big] \cdot P(C_1|I) \end{align*}$

where $C_t$ are captions ("reasoning steps"), $Q_i$ are generated questions, and $A_i$ their answers.

4. Prompt Engineering and Inference Procedure

All prompt templates leverage zero-shot instructions with deterministic decoding. The inference process (for multi-image QG-CoC) is as follows:

prompt1 = "You are given a question: '{Q}'. Decompose it into a numbered list of clear sub-questions..."
(q1,...,qm) = MLLM(prompt1)
for j in range(1, m+1):
    for k in range(1, n+1):
        prompt2 = f"Here is Image {k}: <Image {k}>. Sub-question: '{q_j}'. Provide a detailed caption..."
        c_jk = MLLM(prompt2)
for j in range(1, m+1):
    prompt3 = f"Sub-question: '{q_j}'. Captions: {c_{j,1}},...,{c_{j,n}}. Provide a concise answer."
    a_j = MLLM(prompt3)
prompt4 = f"Sub-questions and answers: ... Based on these, answer the original question: '{Q}'."
final_answer = MLLM(prompt4)

All models use do_sample=False, temperature=0, and a max context of 2 048 tokens (Kao et al., 5 Nov 2025).

5. Empirical Performance and Benchmarking

QG-CoC has been evaluated across multi-image benchmarks (MUIR, MMIU, MuirBench) and single-image generalization tasks (MMMU, MMBench, ScienceQA). The principal metric is answer accuracy (% correct; not BLEU/CIDEr). Main quantitative findings include (Kao et al., 5 Nov 2025):

Model	Method	MUIR	MMIU	ScienceQA	MMMU	MMBench
LLaVA-OV	w/o prompt	41.2	44.6	94.5	45.4	85.1
	QG-CoC	53.3	50.9	94.5	48.9	87.6
Qwen-2.5-VL	w/o prompt	62.1	50.3	90.2	58.2	88.2
	QG-CoC	65.3	56.9	91.9	64.8	89.4
GPT-4o	w/o prompt	70.8	63.3	89.5	63.1	86.0
	QG-CoC	74.9	65.8	90.3	66.7	88.9

Ablations demonstrate that incremental gains are attributable both to the decomposition and targeted captioning phases, with cumulative improvement on challenging multi-image settings (e.g., up to +12 points on MUIR for LLaVA-OV) (Kao et al., 5 Nov 2025). Error analysis on 120 cases revealed that errors are split between misunderstanding the decomposed sub-tasks (33.3%), perception failures (31.7%), and reasoning mistakes post caption extraction (35.0%).

6. Comparative Analysis and Limitations

QG-CoC contrasts with and supersedes naïve per-image captioning, single-image CoT, and prior "chain-of-captions" variants that do not enforce question-guided focus or proper chaining. Notable limitations include:

Scaling bottlenecks: The approach requires chained prompts per (sub-question, image) pair and can strain the context window for large $n$ or $m$ .
Model dependency: Reliance on the MLLM's captioning and reasoning abilities; subpar MLLMs can diminish the method's advantage (Kao et al., 5 Nov 2025).
Explicit knowledge integration: The QG-CoC pipeline remains "model-agnostic" and does not integrate external reasoning tools or explicit spatial/geometric modules—future research directions proposed include hybridizing with tool augmentation and mixed-modal streams.

7. Applications and Illustrative Example

QG-CoC is designed primarily for multi-image reasoning benchmarks, such as comparison, temporal/spatial synthesis, and detailed scene understanding tasks. A representative example involves tabular image comparison for entity matching (Kao et al., 5 Nov 2025): When asked for affiliations of authors shown in three images, naïve captioning yielded vague summaries, whereas QG-CoC decomposed the question by row and entity, producing pointed captions (e.g., "Row 1: Author 'Xu' is from 'Stanford'; Author 'Lee' is from 'MIT'."), directly enabling correct matching in the final answer. For single-image tasks, QG-CoC retains or slightly improves performance over baseline CoT or captioning methods (Kao et al., 5 Nov 2025, Uehara et al., 18 Jan 2024).

QG-CoC establishes a structured, question-driven pipeline for fine-grained multimodal reasoning, empirically validated by accuracy gains on multi-image tasks across both open-source and proprietary MLLMs. It directly addresses perception-reasoning integration deficits in current models and serves as a robust foundation for future multimodal research spanning images, text, and other modalities (Kao et al., 5 Nov 2025, Uehara et al., 18 Jan 2024).