Question-Guided Chain-of-Captions (QG-CoC)
- The paper introduces a novel pipeline that decomposes complex queries into targeted sub-questions, generating localized captions to improve multi-image reasoning.
- The methodology integrates question decomposition, sequential caption generation, and answer synthesis using MLLMs to maintain chain continuity.
- Empirical evaluations demonstrate significant accuracy gains on benchmarks like MUIR and MMIU, validating the effectiveness of structured caption chaining.
Question-Guided Chain-of-Captions (QG-CoC) is a class of methods for multimodal reasoning that orchestrates a structured, question-driven chain of localized captions to facilitate fine-grained perception and inference, particularly in multi-image input regimes for Multimodal LLMs (MLLMs). QG-CoC frameworks systematically decompose complex queries into targeted sub-questions, use these as guides to elicit focused visual captions from each input (image or region), and then integrate these captions and answers in a multi-stage reasoning process. This approach unifies and extends principles from question-guided captioning, chain-of-thought (CoT) prompting, and visual question answering—offering marked improvements over traditional captioning and single-image CoT strategies, especially for tasks involving multi-image synthesis, comparison, and detailed visual reasoning (Kao et al., 5 Nov 2025, Uehara et al., 2024).
1. Motivation and Conceptual Foundations
The challenge addressed by QG-CoC is twofold: the need for (1) fine-grained, task-relevant perception across multiple, disparate images and (2) structured, explicit integration of visual clues for multi-step reasoning. State-of-the-art MLLMs, such as LLaVA, Qwen-VL, GPT-4o, and Gemini-1.5, while highly performant at single-image perception and language understanding, exhibit deficiencies when required to: (a) extract localized details (counting, object identity, spatial features) across >1 image, and (b) integrate these details into holistic multi-image logic chains for complex queries. Existing methods that apply CoT or naive captioning per image oftentimes collapse critical information—either via over-conciseness, over-generalization, or by failing to propagate relevant cues between images and sub-tasks. QG-CoC was developed to address these limitations by enforcing a decomposition-caption-integration pipeline, structurally aligning each step with a sub-aspect of the original query (Kao et al., 5 Nov 2025).
2. Formal Methodology and Pipeline
Given a set of images and a user question , QG-CoC implements the following pipeline:
- Question Decomposition
- The input question is decomposed by prompting the MLLM:
- Sub-questions correspond to sub-aspects (object, relation, attribute, action) necessary for answering .
- Question-Guided Caption Generation (Chain Construction)
- For each sub-question and for each image :
- is conditioned on ; for , is additionally conditioned on prior captions for that sub-question, i.e. , enforcing chain continuity.
- The process is repeated for all sub-questions.
- Sub-Question Answering and Integration
- For each , the MLLM is prompted:
yielding answers . - Final answer synthesis invokes:
This pipeline is executed strictly in zero-shot (no parameter updates), with sample parameters: temperature , context window 2048 tokens, and no specialized tokens or finetuning (Kao et al., 5 Nov 2025).
3. Core Algorithms and Model Architecture
The canonical QG-CoC method is conceptualized as an orchestration of text prompts—no specialized model architecture is required. However, related implementations (such as (Uehara et al., 2024)) incorporate architectural innovations for single-image chains:
- Image Backbone: Pretrained vision encoder (e.g., CLIP ViT-Large) for global/regional features. In dual-input designs, both global image and masked RoI representations are encoded.
- Q-Former Adapter: A trainable Transformer adapter, receiving image embeddings and supplying "query tokens" that encourage region-aware grounding, mirroring BLIP-2.
- Text Decoder: LLM (e.g., LLaMA-2-chat-13B), receiving both Q-Former outputs and projected image features. Generates sequences including reasoning steps, uncertainty scalars, and, when needed, explicit question/answer pairs as part of the chain.
- Formal Generation Factorization:
where are captions ("reasoning steps"), are generated questions, and their answers.
4. Prompt Engineering and Inference Procedure
All prompt templates leverage zero-shot instructions with deterministic decoding. The inference process (for multi-image QG-CoC) is as follows:
1 2 3 4 5 6 7 8 9 10 11 |
prompt1 = "You are given a question: '{Q}'. Decompose it into a numbered list of clear sub-questions..." (q1,...,qm) = MLLM(prompt1) for j in range(1, m+1): for k in range(1, n+1): prompt2 = f"Here is Image {k}: <Image {k}>. Sub-question: '{q_j}'. Provide a detailed caption..." c_jk = MLLM(prompt2) for j in range(1, m+1): prompt3 = f"Sub-question: '{q_j}'. Captions: {c_{j,1}},...,{c_{j,n}}. Provide a concise answer." a_j = MLLM(prompt3) prompt4 = f"Sub-questions and answers: ... Based on these, answer the original question: '{Q}'." final_answer = MLLM(prompt4) |
do_sample=False, temperature=0, and a max context of 2 048 tokens (Kao et al., 5 Nov 2025).
5. Empirical Performance and Benchmarking
QG-CoC has been evaluated across multi-image benchmarks (MUIR, MMIU, MuirBench) and single-image generalization tasks (MMMU, MMBench, ScienceQA). The principal metric is answer accuracy (% correct; not BLEU/CIDEr). Main quantitative findings include (Kao et al., 5 Nov 2025):
| Model | Method | MUIR | MMIU | ScienceQA | MMMU | MMBench |
|---|---|---|---|---|---|---|
| LLaVA-OV | w/o prompt | 41.2 | 44.6 | 94.5 | 45.4 | 85.1 |
| QG-CoC | 53.3 | 50.9 | 94.5 | 48.9 | 87.6 | |
| Qwen-2.5-VL | w/o prompt | 62.1 | 50.3 | 90.2 | 58.2 | 88.2 |
| QG-CoC | 65.3 | 56.9 | 91.9 | 64.8 | 89.4 | |
| GPT-4o | w/o prompt | 70.8 | 63.3 | 89.5 | 63.1 | 86.0 |
| QG-CoC | 74.9 | 65.8 | 90.3 | 66.7 | 88.9 |
Ablations demonstrate that incremental gains are attributable both to the decomposition and targeted captioning phases, with cumulative improvement on challenging multi-image settings (e.g., up to +12 points on MUIR for LLaVA-OV) (Kao et al., 5 Nov 2025). Error analysis on 120 cases revealed that errors are split between misunderstanding the decomposed sub-tasks (33.3%), perception failures (31.7%), and reasoning mistakes post caption extraction (35.0%).
6. Comparative Analysis and Limitations
QG-CoC contrasts with and supersedes naïve per-image captioning, single-image CoT, and prior "chain-of-captions" variants that do not enforce question-guided focus or proper chaining. Notable limitations include:
- Scaling bottlenecks: The approach requires chained prompts per (sub-question, image) pair and can strain the context window for large or .
- Model dependency: Reliance on the MLLM's captioning and reasoning abilities; subpar MLLMs can diminish the method's advantage (Kao et al., 5 Nov 2025).
- Explicit knowledge integration: The QG-CoC pipeline remains "model-agnostic" and does not integrate external reasoning tools or explicit spatial/geometric modules—future research directions proposed include hybridizing with tool augmentation and mixed-modal streams.
7. Applications and Illustrative Example
QG-CoC is designed primarily for multi-image reasoning benchmarks, such as comparison, temporal/spatial synthesis, and detailed scene understanding tasks. A representative example involves tabular image comparison for entity matching (Kao et al., 5 Nov 2025): When asked for affiliations of authors shown in three images, naïve captioning yielded vague summaries, whereas QG-CoC decomposed the question by row and entity, producing pointed captions (e.g., "Row 1: Author 'Xu' is from 'Stanford'; Author 'Lee' is from 'MIT'."), directly enabling correct matching in the final answer. For single-image tasks, QG-CoC retains or slightly improves performance over baseline CoT or captioning methods (Kao et al., 5 Nov 2025, Uehara et al., 2024).
QG-CoC establishes a structured, question-driven pipeline for fine-grained multimodal reasoning, empirically validated by accuracy gains on multi-image tasks across both open-source and proprietary MLLMs. It directly addresses perception-reasoning integration deficits in current models and serves as a robust foundation for future multimodal research spanning images, text, and other modalities (Kao et al., 5 Nov 2025, Uehara et al., 2024).