Papers
Topics
Authors
Recent
Search
2000 character limit reached

POVQA: Preference-Optimized Video QA

Updated 5 October 2025
  • POVQA is a video Q&A framework that compresses high-frame-rate video into pooled image tokens, reducing token count by over 23×.
  • It integrates supervised fine-tuning and Direct Preference Optimization to align large vision-language model outputs with human-like rationales.
  • Evaluations on ReasonVQA and TVQA show significant improvements in F1, BLEU, and ROUGE metrics along with robust zero-shot performance.

POVQA, or Preference-Optimized Video Question Answering, refers to a paradigm and system for efficiently performing Video Question Answering (VQA) using Large Vision-LLMs (LVLMs) by drastically compressing video representations and aligning model outputs with human preferences and rationales. The method, as introduced in (Dahal et al., 1 Oct 2025), leverages temporally pooled image tokens to address data and computational bottlenecks in long-context video reasoning with LVLMs, fine-tuning with supervised rationales and answers, and further preference optimization.

1. Temporal Pooling for Data-Efficient Video Representation

POVQA employs a temporal pooling strategy that compacts high-frame-rate video into a minimal set of per-second summary images. For a video sequence with frame rate ff (e.g., $24-60$ fps), frames within each one-second window Ws={Iτ  τ[(s1)f+1,sf]}W_s = \{ I_\tau ~|~ \tau \in [ (s-1)f + 1, s f ] \} are merged into a single pooled image I~s\tilde{I}_s. This operation implements various pooling functions:

  • Weighted Average (WA): I~s=τWsws(τ)Iτ\tilde{I}_s = \sum_{\tau \in W_s} w_s(\tau) I_\tau where ws(τ)=1/Wsw_s(\tau) = 1/|W_s|.
  • Weighted Average Exponential (WAE): ws(τ)=exp[λ(τsf)]/κexp[λ(κsf)]w_s(\tau) = \exp[\lambda (\tau - s f)] / \sum_\kappa \exp[\lambda(\kappa - s f)] (recency bias).
  • Weighted Average Ramp (WAR): ws(τ)=(τ(s1)f)/κ(κ(s1)f)w_s(\tau) = (\tau - (s-1)f) / \sum_\kappa (\kappa - (s-1)f) (linear recency).
  • Blend-Blur with Last Frame (BBLF): I~s=αIlast+(1α)Gs(Iˉs)\tilde{I}_s = \alpha I_{\text{last}} + (1-\alpha) G_s(\bar{I}_s) (final frame blended with blurred average).

This compresses a 5-minute, 24 fps video from 7200 frames (each with up to 512 image tokens per frame, yielding \sim369,368 tokens) into just 300 pooled images (\sim16,088 tokens)—over 23×23\times reduction—while preserving key motion and appearance features necessary for VQA.

2. Model Alignment via Supervised Fine-Tuning and Preference Optimization

The LVLM (Qwen2.5-VL 7B) is aligned to the compressed video context using a two-stage regimen:

The model is trained to produce a “Reasoning:” rationale and subsequent “Final Answer:” using supervised data. The loss is token-level negative log-likelihood:

LSFT(θ)=E(x,y)ilogπθ(yix,y<i)\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x, y)} \sum_i \log \pi_\theta(y_i|x, y_{<i})

QLoRA (Quantized Low Rank Adaptation) is used for efficient parameter updates.

The model is further optimized on labeled preference triplets (x,y+,y)(x, y^+, y^-) (preferred/dispreferred rationale-answer sequences) using logistic loss:

Δ(x,y+,y)=[logπθ(y+x)logπθ(yx)][logπref(y+x)logπref(yx)]\Delta(x, y^+, y^-) = [\log \pi_\theta(y^+|x) - \log \pi_\theta(y^-|x)] - [\log \pi_{\text{ref}}(y^+|x) - \log \pi_{\text{ref}}(y^-|x)]

LDPO(θ)=E(x,y+,y)[logσ(βΔ(x,y+,y))]\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)} \left[ \log \sigma(\beta \cdot \Delta(x, y^+, y^-)) \right]

where πref\pi_{\text{ref}} is the frozen SFT model.

This protocol aligns LVLM outputs not only for correct answers but also for human-like rationales, ensuring outputs are concise, interpretable, and faithful to evidence.

3. ReasonVQA Dataset and Evaluation Metrics

POVQA was evaluated on ReasonVQA, a hand-curated dataset containing 239 triplets of question, answer, and human rationale from 12 movies spanning diverse genres. Each data point includes temporally anchored subtitle context interleaved with visual tokens. Performance was measured using:

  • F1 Score (precision/recall for QA): Improved from 0.212 (pooled baseline) to 0.543 (POVQA-SFT+DPO).
  • BLEU-4 (n-gram fluency): Increased from 0.031 to 0.291.
  • ROUGE-L (longest subsequence similarity): Rose from 0.196 to 0.528.
  • Embedding Cosine Similarity (semantic alignment with human rationale): Furthermore increased, indicating higher rationale quality.

The gains persisted across pooling operator choices and transfer to zero-shot settings.

4. Robustness to Pooling Operator Choice

The method’s effectiveness is independent of the selected pooling function for summarizing frames. Evaluations using BBLF, WA, WAE, WAR at both training and inference showed consistent improvement in F1, BLEU, ROUGE, and embedding-based rationale quality. This establishes that the temporal pooling itself creates a robust summary for LVLM reasoning, and appearance-preserving pooling (such as BBLF) performs particularly well but the model generalizes across all operators.

5. Cross-Domain Generalization and Zero-Shot Results

Zero-shot evaluations on the TVQA dataset (not seen during training) demonstrate strong generalizability:

  • POVQA fine-tuned on ReasonVQA: 64.7% accuracy.
  • Pooling-only POVQA variant: 69.7% accuracy.

This indicates that temporal pooling provides effective domain adaptation for video QA, with rationales and answers generated by the model remaining coherent and contextually grounded even for unseen data.

6. Implications and Technical Summary

POVQA provides a framework for solving the token bottleneck in LVLM-based video reasoning. The technique enables:

  • Substantial reduction in data footprint (23×\times fewer tokens) while maintaining episode-level context coverage.
  • Efficient alignment with human cognitive processes using rationale supervision and preference optimization.
  • Robust summarization, independent of pooling details, supporting deployment in diverse VQA scenarios with potentially long-form clips.
  • Transferability to large-scale datasets and zero-shot generalization, rendering it practical for real-world and resource-constrained applications.

The key equations governing the system include the temporal pooling:

I~s=τWsws(τ)Iτ\tilde{I}_s = \sum_{\tau \in W_s} w_s(\tau) I_\tau

and the blend–blur operation:

I~s=αIlast+(1α)Gs(Iˉs)\tilde{I}_s = \alpha I_{\text{last}} + (1 - \alpha) G_s(\bar{I}_s)

with the SFT and DPO losses supporting preference-aligned model tuning.

POVQA represents a consolidated approach that allows LVLMs to effectively address long-context VQA tasks using sparse, semantically rich inputs, yielding substantive gains in both answer accuracy and rationale quality in human-annotated and zero-shot settings (Dahal et al., 1 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to POVQA.