Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (2206.08155v2)

Published 16 Jun 2022 in cs.CV, cs.CL, and cs.LG

Abstract: Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive LLMs pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional LLMs (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked LLMing, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

PDF Abstract

Zero-Shot Video Question Answering via Frozen Bidirectional LLMs

The paper "Zero-Shot Video Question Answering via Frozen Bidirectional LLMs" addresses the complex task of Video Question Answering (VideoQA) by conceptualizing a framework that avoids reliance on manually annotated visual question-answer pairs. The manual annotation is often cost-prohibitive, and prior methods in this space have adopted zero-shot settings to bypass this requirement. The paper introduces a method that leverages frozen bidirectional LLMs (BiLM) in tandem with visual inputs for performing zero-shot VideoQA.

The core contribution of the work is the novel approach to adapting frozen BiLMs, traditionally used for processing text-only data, to accommodate multi-modal inputs, specifically videos paired with captions manually scraped from the web. The project's ambition is rooted in maximizing efficacy while minimizing computational overhead compared to prior autoregressive models.

The methodology is structured around key components: (1) integrating visual inputs with frozen BiLMs via minimal trainable modules, (2) leveraging web-scraped datasets for training these modules, and (3) implementing an inference method based on masked LLMing. This approach, termed FrozenBiLM, reportedly achieves superior performance compared to existing methods across an array of datasets like LSMDC-FiB, iVQA, MSRVTT-QA, among others.

From an implementation perspective, the paper emphasizes preserving the integrity of the text-only-pretrained BiLM by freezing its parameters while appending lightweight visual modules. It draws on the bidirectional capabilities intrinsic to these models to handle masked text processing and zero-shot inference effectively. The novel utilization of BiLMs for multi-modal reasoning stands in contrast with existing autoregressive approaches, demonstrating enhanced performance at a fraction of computational expense.

In practical outcomes, FrozenBiLM is not only competent in zero-shot scenarios but also proves competitive in few-shot and fully-supervised settings. The empirical analyses within the paper demonstrate significant numerical improvements in benchmark accuracies across diverse video question-answering tasks. The ablation studies underscore the importance of visual input incorporation and adaptive training, elucidating the contributory roles of their architectural choices.

Theoretical implications point toward a shift in designing LLMs for multi-modal tasks, suggesting that bidirectionality in LLMing might be more apt for integrating heterogeneous data streams than autoregressive models. From a practical standpoint, the approach also reduces resource requirements, which is increasingly pivotal given the growing environmental concerns around the computational costs of AI.

In terms of future directions, the framework presented fosters opportunities for explorations in scaling BiLMs and training on diverse, large-scale multi-modal datasets like exhaustive YouTube video collections. The current limitations reside in its constrained adaptation for complex generative tasks such as video captioning, yet this groundwork suggests a promising pivot from traditional autoregressive models to more efficient bidirectional methodologies in AI research.

Conclusively, this paper posits FrozenBiLM as an efficient paradigm for zero-shot VideoQA by harmonizing bidirectional LLMs with minimal additional training, advocating for a new trajectory in language-vision integration research.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Antoine Yang (12 papers)
Antoine Miech (23 papers)
Josef Sivic (78 papers)
Ivan Laptev (99 papers)
Cordelia Schmid (206 papers)

Citations (196)

View on Semantic Scholar

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (2206.08155v2)

Zero-Shot Video Question Answering via Frozen Bidirectional LLMs

Related Papers

GitHub

YouTube