Papers
Topics
Authors
Recent
2000 character limit reached

HERBench: Multi-Evidence VideoQA Benchmark

Updated 20 December 2025
  • HERBench is a benchmark designed for VideoQA that requires compositional reasoning over at least three non-redundant, temporally distinct evidential cues.
  • It enforces a minimum evidential requirement to eliminate shortcuts from single-frame or text-only solutions, ensuring genuine cross-time cue integration.
  • The MRFS metric quantifies evidential demand, revealing current models’ fusion deficits and guiding improvements in multi-frame aggregation.

HERBench (High Evidential Requirement Benchmark) is a Video Question Answering (VideoQA) evaluation suite designed to rigorously assess and advance models’ capacity for multi-evidence integration across temporally separated video segments. Unlike earlier benchmarks that often admit single-frame or shallow shortcuts, HERBench is structured so that each question necessitates compositional reasoning over at least three non-redundant, temporally distinct evidential cues, compelling genuine cross-time aggregation and robust video understanding (Ben-Ami et al., 16 Dec 2025).

1. Motivation and Conceptual Basis

Current VideoQA benchmarks frequently suffer from evidential under-specification, permitting correct answers through language priors, isolated salient frames, or static scene features. This leaves crucial capabilities—such as binding an entity’s appearance to its later actions, evaluating temporal order, or verifying event consistency—insufficiently tested. HERBench was developed in response to the recognition that authentic video understanding, particularly in naturalistic domains, demands compositional aggregation of visually and temporally distributed cues. It explicitly enforces an evidential requirement (ER) of k3k\ge3 distinct, non-overlapping cues per question, thereby preventing solution via single-snapshot reasoning (Ben-Ami et al., 16 Dec 2025).

2. Benchmark Construction and Structure

HERBench comprises 26,806 five-way balanced multiple-choice questions sourced from 336 videos (average duration 395 s, range 60–2100 s) across egocentric and third-person datasets including HD-EPIC, WildTrack, PersonPath22, and YouTube trailers. Each question targets one of twelve compositional tasks, grouped into four families according to their reasoning demands. All question types require compositional integration of evidence but probe different facets of cross-temporal and multi-entity reasoning.

Table: Reasoning Families and Representative Tasks

Reasoning Family Representative Tasks Key Requirement
Temporal Reasoning TSO, MPDR, ASII Chronological order, durations
Referring – Tracking AGBI, AGAR, AGLT Identity binding, trajectory
Global Consistency – Verification FAM, SVA, FOM Non-occurrence, scene consistency
Multi-Entity Aggregation – Numeracy MEGL, AC, RLPC Counting, set membership

Task Examples

  • Temporal Shot Ordering (TSO): Reconstruct chronological order from four shuffled shot descriptions, requiring mapping to at least three distinct video segments.
  • Multi-Person Duration Reasoning (MPDR): Compare intervals of appearance for described people, necessitating interval-based aggregation.
  • Appearance-Grounded Behavior Interactions (AGBI): Track a person via appearance and link with behavioral interactions across multiple glimpses.
  • False Action Memory (FAM): Identify an action that never occurs by verifying the presence of candidates in three or more moments.
  • Action Counting (AC): Count disjoint instances of a specific action-object pair throughout the timeline, precluding single-frame solutions.

3. Minimum Required Frame-Set (MRFS) and Evidential Demand

HERBench introduces a formal metric, Minimum Required Frame-Set (MRFS), to quantify the evidential demand inherent in each question. For any model ff, frame-ranking heuristic rr, frame budget xx, and indicator E(y^,y)E(\hat y, y),

MRFSx(q;f,r)=min{k{1,,x}:E(f(q,Fk),y)=1}\mathrm{MRFS}_x(q; f, r) = \min\{ k \in \{1, \dots, x\} : E(f(q, F_k), y) = 1 \}

where FkF_k is the top-kk frames and E(f(q,),y)=0E(f(q, \varnothing), y) = 0 precludes text-only solvability, and Fk={π1,,πk}F_k = \{\pi_1, \dots, \pi_k\} for frames ranked by r(v,q)r(v,q) over video vv, question qq.

HERBench, fixed to f=f = Qwen2.5-VL, r=r = AKS adaptive sampler, x=16x=16, attains mean MRFS =5.49=5.49, versus NExT-QA ($2.61$), MVBench ($3.52$), and LongVideoBench ($4.07$), confirming that its questions demand significantly greater compositional frame fusion (Ben-Ami et al., 16 Dec 2025).

4. Evaluation Protocols, Metrics, and Failure Analysis

13 state-of-the-art Video-LLMs—including GPT-4.1, Gemini-2.5-Flash, Qwen3-VL, Ovis-2.5, InternVL3.5, and LLaVA-OneVision—are evaluated using a five-way multiple-choice protocol (random baseline =20%=20\%) with 16 frames per video sampled uniformly. Primary scoring is top-1 accuracy. Key sources of model error are disentangled into:

  • Retrieval Deficit: The ability (or failure) of frame selection modules (Uniform, Vanilla-BLIP, BOLT-ITS, AKS, Oracle Frames) to surface all relevant evidence.
  • Fusion Deficit: Fusion bottleneck even when all essential frames are provided, measured by oracle-only setups and statistical frame 'importance shares.'

Observed performance spans 31.4%42.1%31.4\%-42.1\% (mean 38.2%38.2\%), only marginally above random chance, exposing pervasive multi-evidence fusion failures (Ben-Ami et al., 16 Dec 2025).

Detailed Findings

  • Stronger model performance is seen on tasks allowing solution by tracking a single entity (e.g., AGBI, AGAR $70$–80%80\%), despite enforced compositional structure.
  • True multi-evidence aggregation tasks (e.g., AC, MEGL, TSO, SVA) yield near-chance accuracy (30%\leq 30\%).
  • Even with Oracle Frames, accuracy remains below 50%50\% in most cases, revealing the fusion deficit is fundamental, not merely due to frame selection.
  • Fusion diagnostics show correct answers involve balanced per-frame logit importance (Top-1 share \sim0.5), whereas incorrect answers over-concentrate on one frame (Top-1 share \sim0.8), indicating oversimplified reasoning and disregard of complementary evidence.

5. Comparison with Existing Benchmarks

By enforcing ER3ER\ge3 and quantifying evidential demand via MRFS, HERBench surpasses prior benchmarks—NExT-QA, MVBench, LongVideoBench—in both compositional reasoning requirement and empirical multi-frame integration stress. Earlier datasets were susceptible to solution via single object, scene, or text-based priors, while HERBench precludes such shortcuts by its structural and evaluative constraints (Ben-Ami et al., 16 Dec 2025).

6. Implications for Model Development and Future Research

HERBench establishes multi-evidence retrieval and fusion as dual frontiers for robust Video-LLM advancement. This implies that future models will need:

  • Joint training regimes that tightly couple frame selection and reasoning modules.
  • Cross-frame attention architectures capable of weighing and compositing three or more temporally distributed cues.
  • Explicit integration of temporal logic primitives and fusion diagnostics to bridge the observed bottlenecks.

A plausible implication is that specialized benchmarks like HERBench will serve as both yardsticks and diagnostic toolkits, guiding the trajectory of VideoQA to prioritize deep compositional aggregation over shallow, shortcut-driven performance (Ben-Ami et al., 16 Dec 2025).

7. Relationship to Human-Centric Multimodal Understanding

HERBench is distinct from benchmarks such as HERM-Bench, which focus on human-centric perception and cognition in image–text contexts. While both emphasize the importance of compositional reasoning, HERBench’s unique contribution is its quantitative enforcement and measurement of multi-evidence integration across time, structurally eliminating solution via single-cue or text-only priors (Li et al., 9 Oct 2024). The methodological advances in HERBench reveal analogous deficits—unbalanced evidence weighting, fusion bottlenecks—that also manifest in human-centric settings, highlighting broader challenges in multimodal LLM design.


HERBench constitutes a rigorous, structurally enforced solution to the limitations of prior VideoQA benchmarks, by making cross-time evidence integration both unavoidable and quantifiable. Its wide adoption is likely to accelerate Video-LLM development towards genuinely compositional video understanding and principled evaluation of multi-evidence aggregation capabilities (Ben-Ami et al., 16 Dec 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to HERBench.