QCaption: Question-Aware Captioning

Updated 17 January 2026

QCaption is a framework that employs multimodal models to generate and assess captions tailored to specific user questions.
It integrates techniques such as geometry-informed visual encoders, question-guided modules, and late-fusion pipelines to enhance caption utility.
QA-driven evaluation metrics are used to gauge caption effectiveness, aligning performance with downstream question answering tasks.

A QCaption system refers to a class of techniques and frameworks that generate, process, or exploit question-aware or question-controlled captions, most commonly in multimodal (image or video) contexts. QCaption approaches combine advances in vision-language modeling, question answering (QA), controllable generation, and evaluation methodology to produce captions that are adaptive to questions (question-controlled captioning), facilitate downstream QA (caption-for-QA), or enable caption evaluation via QA-derived metrics. QCaption has also been used as a pipeline name for a modular fusion-based video captioning and Q&A system. This article surveys core QCaption architectures, control signal methodologies, evaluation protocols, algorithmic designs, and their impact on captioning research and practice.

1. Definitions and Paradigms

QCaption denotes several closely related lines of research:

Question-controlled captioning: The system generates a caption that not only describes an image or video but also directly answers a posed question, often incorporating information extracted from both visual objects and scene text (Hu et al., 2021).
QA-driven caption evaluation: Caption quality is measured by generating targeted questions from the caption and evaluating whether either a reference caption or the original image/video can answer those questions correctly (Lee et al., 2021).
Utility-oriented caption quality measurement: QCaption may refer to frameworks that estimate how effective captions are for supporting downstream QA tasks, directly quantifying their utility as substitutes for images or videos in agentic or retrieval settings (Yang et al., 26 Nov 2025, Zhang et al., 29 May 2025).
Modular video captioning and Q&A fusion pipelines: "QCaption" has also been used as the name of a late-fusion pipeline for video captioning and QA, wherein off-the-shelf frame-level captioners and LLMs are combined to produce comprehensive video descriptions or answers (Wang et al., 10 Jan 2026).

2. Question-Controlled Captioning Architectures

The principal instantiation of question-controlled captioning is the Geometry and Question Aware Model (GQAM) (Hu et al., 2021). Given an image $I$ , a coarse initial caption $C^{\rm ini}$ (usually object-centric), and a set of user questions $\mathcal{Q} = \{Q_1, \ldots, Q_K\}$ , the model produces a text-aware caption $Y$ by conditioning on all three inputs:

$p_\theta(Y \mid I, C^{\rm ini}, \mathcal{Q}) = \prod_{t=1}^L p_\theta(y_t \mid y_{<t}, I, C^{\rm ini}, \mathcal{Q})$

Geometry-informed Visual Encoder: Fuses object and scene text regions using spatial bias and visual affinity.
Question-guided Encoder: Attends from each question token to relevant visual features, constructing question-aware representations.
Multimodal Decoder: Merges geometry-informed features, question tokens, and initial caption, with pointer-generator to output text or copy OCR tokens.

Training is performed via cross-entropy with ground-truth captions. Empirically, this approach—especially the inclusion of the question-guided module—is essential for high performance; removing the question input leads to a collapse in caption quality (CIDEr drop of ~120) (Hu et al., 2021).

Two large-scale datasets, ControlTextCaps and ControlVizWiz, support this paradigm. They are constructed by systematically stripping scene text from existing captions, generating object-only initial captions, and synthesizing question–answer pairs focused on scene text.

3. QA-Driven Caption Evaluation Metrics

Question-answering is used as an explicit mechanism for evaluating how well a caption conveys visual content, notably in the QACE (“Asking Questions to Evaluate an Image Caption”) metric (Lee et al., 2021). The procedure is:

Question generation: Automatically extract contentful noun phrases from the caption and, for each, generate a targeted QA pair using a fine-tuned T5 model.
QA answering: For each generated question, the system extracts answers from both the evaluated caption and a context (reference caption or image).
- In QACE₋Ref: context is a gold reference caption; answers are produced by a textual QA model.
- In QACE₋Img: context is the source image; answers are produced by an abstractive VQA model (“Visual-T5”).
Answer comparison and scoring: For each QA pair, partial scores are computed:
- SQuAD-style F1 score (word overlap),
- BERTScore (contextual embedding similarity),
- Answerability (confidence score for answerable). Scores are averaged per QA, then across all questions:

$\textrm{QACE}(x, ctx) = \frac{1}{M} \sum_{i=1}^{M} \frac{F1_i + \mathrm{BERTScore}_i + \mathrm{Ansbl}_i}{3}$

QACE₋Img is a multi-modal, reference-less, and explainable metric, while QACE₋Ref provides a direct comparison to reference wording. In experiments, QACE₋Img achieved strong correlation with human judgments and outperformed previous reference-less metrics (Lee et al., 2021).

4. Caption Utility and Downstream QA

Utility-centric benchmarks such as CaptionQA directly measure how well a caption supports downstream multimodal applications (Yang et al., 26 Nov 2025). The utility is quantified by the success rate of a text-only LLM in answering domain-specific, fine-grained multiple-choice questions using only the caption, compared to the image:

$U(\mathbf{C}) = \frac{1}{|\mathcal{Q}|} \sum_{\mathbf{Q}\in\mathcal{Q}} s(\mathbf{Q},\mathbf{C})$

where $s(\mathbf{Q},\mathbf{C})=1$ if the LLM answers correctly, $0$ for incorrect, and $\frac{1}{K}+0.05$ for “Cannot answer.”

Empirical results indicate a substantial utility gap between captions and images, especially for open-source models. For instance, caption utility on downstream QA tasks can drop by up to 32% compared to image utility for state-of-the-art MLLMs. This demonstrates that traditional metrics (e.g., BLEU, CIDEr) do not necessarily capture the effectiveness of captions as surrogates for image content (Yang et al., 26 Nov 2025).

5. Modular Pipeline QCaption Systems

The QCaption pipeline architecture for video captioning and Q&A exemplifies a modular, late-fusion approach (Wang et al., 10 Jan 2026):

Pipeline stages:

Key-frame extraction: Selects $N=8$ frames per video using methods such as Katna (color-space difference), regular interval, or random sampling.
Visual captioning: For each key frame, generate a detailed image caption with a large multimodal model ( $f_\textrm{LMM}$ , e.g., LLaVA-v1.5).
Text aggregation: Concatenate the frame captions and aggregate them using a LLM ( $f_\textrm{LLM}$ , e.g., Vicuna-v1.5) to generate a final video caption or answer.

Formally: $X_i = f_\mathrm{LMM}(I_i, X_T)\qquad (i=1\ldots N)$

$C = X_1 \mathbin{\|} X_2 \mathbin{\|} \cdots \mathbin{\|} X_N$

$X_o = f_\mathrm{LLM}(C, X_C)$

This late-fusion strategy achieves significant gains over early-fusion video captioners—up to 44.2% improvement in CIDEr for captioning, and 48.9% in QA accuracy on standard video datasets. Ablations show that removing the final LLM aggregator causes a 7–10 percentage point drop in QA accuracy and up to 19% relative loss in captioning (Wang et al., 10 Jan 2026).

Beyond explicit question conditioning, QCaption research encompasses finer control of generation and evaluation:

Control signal of sentence quality: QCaption (Zhu et al., 2022) proposes assigning each caption a quality embedding based on its CIDEr score relative to other captions for the same image. This quality level is fed as an input to the captioning network, enabling the model to preferentially generate high-quality captions at inference.

$x_i = e_q + e_{y_i} + e_{p_i}$

where $e_q$ is the learned quality embedding.

Quality estimation without references: Models trained on large-scale human evaluations can predict, for any $(I, c)$ pair, the probability a human would judge $c$ a “good” caption for $I$ , without reference captions (Levinboim et al., 2019). Seeded by over 600k human binary ratings, such models facilitate inference-time filtering to serve more helpful captions, with demonstrated $3.4\times$ gains in recall of useful captions at fixed precision.
Utility-motivated learning and evaluation: Methods such as CapWAP frame caption generation as maximizing expected downstream QA performance with respect to a distribution of user information needs (expressed as question–answer pairs), using reinforcement learning with QA-based reward (Fisch et al., 2020).

7. Empirical Impact and Open Research Issues

QCaption systems and metrics drive progress along several technical axes:

Controllability and diversity: Question-guided and quality-controlled captioners produce more informative, question-relevant, and diverse descriptions, especially in contexts with scene text and user-specific interests (Hu et al., 2021).
Evaluation signal: QA-derived metrics (QACE, CaptionQA) offer fine-grained, explainable evaluations aligned with actual utility, revealing gaps left by reference-based measures (Lee et al., 2021, Yang et al., 26 Nov 2025).
Modularity and extensibility: Fusion pipelines enable flexible deployment and systematic ablation. Caption-for-QA and QA-for-caption models can share core architectures and data.
Limitations and future directions: QCaption performance remains bounded by the compositional and spatial reasoning skills of the underlying MLLMs, especially in multi-turn, multi-perspective, or temporal settings. Integrating richer intermediate representations (graphs, sketches), extending to other modalities, or jointly fine-tuning LMM+LLM modules are active areas for improvement (Wang et al., 10 Jan 2026, Kao et al., 5 Nov 2025).

QCaption, in its various forms, provides a foundational methodology for aligning caption generation and evaluation with user information needs and concrete downstream tasks, complementing and extending the traditional paradigm of reference-based captioning.