NExT-QA Benchmark

Updated 2 September 2025

NExT-QA is a VideoQA benchmark that emphasizes deep temporal and causal reasoning, going beyond basic scene description.
It uses both multi-choice and open-ended question formats to systematically evaluate models on causal chains and action sequences.
Empirical results reveal a significant gap between model and human performance, highlighting the need for advanced graph-based and integration methods.

NExT-QA is a video question answering (VideoQA) benchmark specifically devised to drive progress from mere scene description toward explaining and reasoning about temporal actions in naturalistic videos. It introduces rigorous requirements for causal and temporal reasoning, expanding the evaluative landscape beyond traditional descriptive VQA tasks. The benchmark’s dual-task formulation (multi-choice and open-ended question answering) and its comprehensive annotation process uniquely position it as a testbed for developing and assessing models capable of deep video understanding.

1. Motivations and Benchmark Design

NExT-QA was established to address a fundamental gap in VideoQA: while previous benchmarks primarily emphasized surface description of scenes or objects, they did not systematically evaluate a model’s ability to perform causal and temporal action reasoning. The intent is to challenge models to not only identify “what” occurs but to also explain “why” and “how”—for instance, elucidating the causal chain resulting in an event or placing actions in correct temporal sequence.

The dataset’s videos are sourced from VidOR and YFCC-100M, ensuring coverage of real-world daily activities with complex object interactions. This distinguishes NExT-QA from earlier datasets which often consisted of short, isolated clips or focused predominantly on descriptive querying.

2. Task Formulation and Question Typology

NExT-QA is defined by two main QA formats:

Multi-choice QA: For each question, five candidate answers are provided (one correct, four plausible but incorrect distractors). Distractors are constructed to be semantically close to the ground-truth, attenuating the risk of trivial selection based on superficial similarity.
Open-ended QA: Answers must be generated as free-form text, typically at the phrase level, with no list of candidate responses available.

Questions are stratified into three categories:

Category	Description	Example
Causal	Why/how actions occur; requires visible cause-effect	“Why is the toddler crying?”
Temporal	Order, co-occurrence, or sequence of actions	“What did the boy do before…?”
Descriptive	Traditional scene or object queries	“How many bowls are on the table?”

Causal questions constitute approximately 48% of the set, temporal 29%, and descriptive the remainder. The strong emphasis on causal and temporal reasoning manifests in both task construction and annotation guidelines.

3. Evaluation of Baseline and Established Models

A comprehensive suite of baseline tests reveals the challenge posed by NExT-QA:

Heuristic-based baselines (e.g., always select the “longest” or “most popular” answer) yield only marginal improvements over random chance (∼20% accuracy), indicating that answer distribution bias cannot be exploited for success.
Retrieval-based methods like SimAA and SimQA (utilizing Sentence-BERT for similarity scoring) also perform poorly; semantic similarity does not capture the requisite logic or temporal reasoning.
BlindQA studies (answering without video input) using advanced pre-trained LLMs such as BERT-FT exhibit a considerable deficit (∼43% for causal QAs), compared to human performance (∼88%), highlighting the necessity for true video understanding.
Established VideoQA models (EVQA, STVQA, CoMem, HME, HCRN, and HGA) adapted to NExT-QA perform significantly better on descriptive queries but are markedly inferior on causal and temporal subsets. For example, HGA (graph-based reasoning) achieves only ∼48–50% accuracy on these subsets, dramatically lower than human baselines (87–90%).

In the open-ended QA regime, model performance degrades further. The WUPS metric is used for these evaluations, defined as

$WUPS(P, R) = \min \left\{ \prod_{p \in P} \max_{r \in R} WUP(p, r), \prod_{r \in R} \max_{p \in P} WUP(r, p) \right\} \times 100$

where $P$ and $R$ represent the prediction and reference answer word sets, and WUP is Wu-Palmer similarity.

4. Insights and Implications for Model Development

The observed performance gap—approximately 10% (multi-choice) and up to 30% (open-ended) between models and human annotators on causal and temporal questions—demonstrates the inability of current architectures to engage in deeper reasoning. While superficial scene understanding is well within reach of state-of-the-art VideoQA systems, causal and temporal action reasoning remain largely unsolved.

Detailed analysis suggests:

Graph-based approaches (e.g., HGA) hold promise by explicitly encoding inter-object and temporal relationships, suggesting that models leveraging explicit structure may better address complex reasoning.
Appearance-motion feature fusion: Simple concatenation yields suboptimal outcomes; more sophisticated integration strategies are required.
Vision-language representation: Adapting or fine-tuning BERT-based language encoders improves results over off-the-shelf variants, yet holistic approaches—potentially via joint transformer-style models (e.g., VideoBERT, VilBERT)—are expected to further close the reasoning gap.
Open-ended generation: Success in open-ended QA remains strongly limited by the models’ ability to ground generated answers in video evidence, not merely textual correlation.

5. Dataset Structure and Annotation Methodology

NExT-QA comprises 5,440 videos and approximately 52,000 manually curated question-answer pairs, split as follows:

Split	Videos	Questions
Train	3870	Bulk
Validation	570	Bulk
Test	1000	Bulk

Videos average 44 seconds, depicting rich, contextually interrelated actions. Annotation is executed in three phases to ensure question–answer pairs are precisely grounded in the video. For multi-choice QA, distractors are selected based on cosine similarity from Sentence-BERT embeddings and further filtered for semantic plausibility without overlap with the correct answer.

6. Accessibility and Research Utilization

All resources stemming from NExT-QA—including videos, question-answer annotations, distractor sets, evaluation code, and baseline implementations—are publicly accessible at https://github.com/doc-doc/NExT-QA.git. The availability of comprehensive evaluation protocols and the WUPS metric for open-ended tasks further facilitates rigorous comparative studies.

7. Influence and Future Directions

NExT-QA’s design and analysis identify several promising research directions:

Improved structured reasoning via explicit graphs for object-action-temporal relations
More effective integration of appearance and motion cues
End-to-end pre-training of vision-LLMs tailored for causal and temporal inference
Refinement of natural language generation methods for free-form, evidence-grounded answers

This suggests that future VideoQA models must transcend shallow matching to achieve human-level performance on complex, temporally-structured understanding tasks. The pronounced gap between state-of-the-art methods and human annotators underscores NExT-QA’s utility as a benchmark for measuring and driving advances in high-order video reasoning.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to NExT-QA Benchmark.