Large Language Models are Temporal and Causal Reasoners for Video Question Answering (2310.15747v2)

Published 24 Oct 2023 in cs.CV

Abstract: LLMs have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as ungrounded guesses' orhallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.

Authors (5)

Dohwan Ko (6 papers)
Ji Soo Lee (3 papers)
Wooyoung Kang (6 papers)
Byungseok Roh (16 papers)
Hyunwoo J. Kim (70 papers)

Citations (23)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs, pretrained on extensive data, excel as temporal and causal reasoners in VideoQA tasks.
The paper introduces Flipped-VQA, a framework combining standard VQA, VAQ, and QAV objectives to reinforce video and question understanding.
The paper achieves state-of-the-art results on five VideoQA datasets while effectively mitigating linguistic bias with efficient fine-tuning.

LLMs as Temporal and Causal Reasoners in VideoQA

The paper "LLMs are Temporal and Causal Reasoners for Video Question Answering" explores the utilization of LLMs for Video Question Answering (VideoQA) by leveraging their innate reasoning abilities. The paper introduces a framework known as Flipped-VQA, which effectively capitalizes on LLMs' strong priors for temporal and causal reasoning, thus enhancing the performance of these models on VideoQA tasks.

Summary of Contributions

The paper presents several key contributions:

Investigation of LLMs' Reasoning Abilities: It is identified that LLMs, by virtue of their pretraining on extensive corpora, are effective temporal and causal reasoners. This is applicable not only to text-based tasks but also extends into the multimodal domain of VideoQA. The paper provides empirical evidence showing that larger LLMs outperform smaller models in handling complex causal and temporal questions.
Introduction of Flipped-VQA Framework: A novel methodology, Flipped-VQA, is introduced. This framework includes three objectives—standard Visual Question Answering (VQA), Video-to-Question (VAQ), and Question-to-Video (QAV)—which collectively prompt the model to understand intricate relationships between the video, question, and answer. The approach encourages the generation of questions and video segments using LLMs' knowledge, enhancing their reasoning abilities.
Evaluation and Performance: The proposed LLaMA-VQA model, utilizing the Flipped-VQA framework, is benchmarked against both LLMs-based and non-LLMs-based models across five notable VideoQA datasets. It achieves state-of-the-art results, particularly excelling in causal and temporal question types. The paper notes that the fine-tuning involves a small fraction of the total parameters, reflecting the framework's efficiency.
Mitigation of Linguistic Bias: A crucial aspect of the paper is the demonstration of Flipped-VQA's ability to mitigate linguistic bias. By flipping the roles of input-output pairs, the framework helps decrease erroneous predictions that stem from over-reliance on linguistic shortcuts, also referred to as hallucinations. This bias mitigation result is validated through a comprehensive analysis of attention mechanisms and embedding alignments.

Implications and Future Directions

The findings underscore the potential of LLMs in multimodal applications extending beyond text-based tasks, embracing video-based reasoning tasks. The demonstrated ability of LLMs to function as robust temporal and causal reasoners provides promising directions for enhancing interactive AI systems to understand and predict human actions and interactions more accurately.

Despite the promising results, the paper acknowledges the limitation regarding the computational expense due to the LLMs' inherent size. Thus, future endeavors might include streamlining these models' inference processes or finding a balance between model size and computational efficiency.

Further research could also explore extending the Flipped-VQA framework to include additional modalities or more diverse types of reasoning tasks, potentially increasing the adaptability and practical applicability of LLMs in more complex real-world scenarios.

Overall, this paper contributes significant advancements in understanding and leveraging LLMs for temporal and causal reasoning tasks in VideoQA, setting the stage for future explorations in the domain of AI-aided multimedia comprehension.

PDF Markdown

Related Papers

GitHub

GitHub - mlvlab/Flipped-VQA: Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023) (74 stars)

YouTube

Show All Videos