JointAVBench: Multimodal Reasoning Benchmark

Updated 21 December 2025

JointAVBench is a comprehensive benchmark designed to evaluate joint audio-visual reasoning in advanced multimodal large language models using movie-scale video content.
It enforces strict multi-modal dependency by requiring integration of speech, sound events, music, and vocal traits across varied temporal spans and cognitive dimensions.
Empirical results reveal that while Omni-LLMs outperform unimodal models, they struggle with nuanced temporal abstraction and abstract reasoning, highlighting directions for future research.

JointAVBench is a comprehensive benchmark specifically developed to evaluate the joint audio-visual reasoning capabilities of advanced multimodal LLMs (Omni-LLMs) over movie-scale video content. It is distinguished by enforcing strict multi-modal dependency for all evaluation questions, systematically spanning five cognitive reasoning dimensions, four classes of audio information, and three levels of temporal scene span. JointAVBench introduces a data synthesis pipeline—coupling vision-LLMs, audio-LLMs, and a general-purpose LLM—to produce and filter high-quality multiple-choice questions requiring true joint audio-visual understanding, and establishes an empirical foundation for assessing the strengths and limitations of state-of-the-art models in this domain (Chao et al., 14 Dec 2025).

1. Strict Formulation and Evaluation Objectives

JointAVBench is designed to address deficiencies in previous audio-visual benchmarks, aiming to assess whether models can perform true cross-modal reasoning beyond independent modality comprehension. Each question is constructed to guarantee strict joint dependency: neither the visual nor the audio stream alone suffices to correctly answer, enforcing that models leverage multi-modal fusion. The benchmark systematically varies:

Audio signal type (speech, sound events, music, vocal traits);
Scene span (single-scene, cross-scene, and full-scene, each probing different memory and abstraction depths);
Cognitive demand (temporal localization, spatial reasoning, emotion and affect detection, plot/narrative inference, long-range memory).

This structure enables analysis of both the modality fusion abilities and temporal reasoning capabilities of Omni-LLMs, providing fine-grained insights into model strengths and bottlenecks (Chao et al., 14 Dec 2025).

2. Taxonomy: Cognitive Dimensions and Audio Signal Types

Questions in JointAVBench are taxonomized by two orthogonal axes: cognitive skills and audio type.

Cognitive Dimensions:

Temporal: e.g., Speech-based Timepoint Localization, requiring the model to detect “when” a salient event occurs via both dialogue and visual cues.
Spatial: e.g., Sounding Object Grounding, requiring “where” an audio-visual event originated.
Emotional: e.g., Speaker Emotion Recognition, assessing identification of affective state utilizing voice and facial/body cues.
Plot: e.g., Plot Development Prediction or Character Relationship Inference, requiring high-level narrative reasoning.
Long-form: e.g., Cross-scene Association, forcing memory and integration across multiple, temporally distant video segments.

Audio Information Types:

Speech (SPE): explicit dialogue and narrative content.
Vocal Traits (VOT): paralinguistic cues—emotion, pitch, language accent.
Sound Events (SEV): discrete, non-speech sounds integrated with visuals (e.g., footsteps, breaking glass).
Music (MUS): soundtrack or mood-setting background.

This design ensures coverage of both high-level and low-level audio-visual reasoning tasks.

3. Scene Span and Multi-Temporal Reasoning

Three scene span categories are integral to JointAVBench:

Single-scene: Questions limited to a contiguous clip (<1 min), probing localized reasoning.
Cross-scene: Spanning several adjacent scenes (∼1–10 min), requiring association of events and maintaining context/memory.
Full-scene: Spanning the entire movie or long clip (>10 min), challenging long-range abstraction and memory.

Empirical results indicate significant drops in model accuracy when transitioning from single- to cross-scene tasks, highlighting limitations in current temporal abstraction and memory integration (Chao et al., 14 Dec 2025).

4. Automated QA Generation Pipeline

To enable large-scale, high-quality annotation without costly manual effort, JointAVBench employs a dedicated semi-automated pipeline for question-answer (QA) generation and filtering:

Scene Segmentation: PySceneDetect splits the video into logical scenes; similar scenes merged for context continuity.
Omni-modal Captioning: For each scene:
- VisionLLM (Qwen2.5-VL) generates dense visual captions.
- AudioLLM_Omni (Qwen2.5-Omni) produces:
  - Speech transcriptions via Whisper.
  - Paralinguistic, sound event, and music captions.
- Captions from all available modalities are aggregated.
QA Pair Synthesis: For each taxonomy-defined QA type, GeneralLLM (Qwen2.5) synthesizes candidate QA pairs using structured prompts and selected captions.
Quality Control: Generated QA pairs undergo:
- Automatic general and task-specific quality checks for format and strict modality-dependency.
- Human final filtering to retain top-quality samples.
Distractor Generation: GeneralLLM creates plausible distractors for each verified QA, assembling the final MCQ format.

A total of 2,853 high-quality MCQ samples result, each requiring fusion of audio and visual information for correct resolution (Chao et al., 14 Dec 2025).

5. Mathematical Definitions and Evaluation Metrics

Two key evaluation metrics are defined in the benchmark:

Audio-Visual Correlation Ratio (AVCR):

$\mathrm{AVCR} = \frac{\#\text{ of QA requiring both audio and vision}}{\text{total }\#\text{ of QA}}$

For JointAVBench, AVCR is 100%, reflecting strict dependency by construction.

Accuracy: Standard MCQ accuracy,

$\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N\mathbf{1}(\hat y_i = y_i)$

where $N$ is the number of questions, $\hat y_i$ is the model prediction, $y_i$ the ground-truth answer.

These formalizations underpin the empirical analysis and enable cross-model comparisons on precisely aligned tasks.

6. Empirical Findings and Baseline Model Performance

JointAVBench was used to systematically benchmark Omni-LLMs (notably Gemini 2.5-Pro and Qwen3-Omni), Video-LLMs, and Audio-LLMs. Major quantitative findings include:

The best-performing Omni-LLM obtains a mean accuracy of 62.6%, significantly outperforming the 30–50% accuracy range of unimodal baselines.
Performance is modality-dependent:
- Strongest on sound-event and music tasks (visually grounded, less abstract integration).
- Weakest on speech content and vocal trait questions (requiring nuanced cross-modal fusion).
Accuracy by scene span:
- Single-scene: 60–75%.
- Cross-scene: drops 15–20%, to ~40–50%, reflecting temporal abstraction difficulty.
- Full-scene: partial recovery (50–60%) on global, narrative-focused queries.
By cognitive dimension:
- Omni-LLMs excel at plot and long-form memory, lag on emotion and certain spatial grounding tasks.
In all cases, joint (audio+visual) input outperforms either stream alone for nearly all task categories, substantiating the necessity and practical benefit of true multimodal reasoning.

This empirical landscape reveals that current Omni-LLMs possess only partial audio-visual integration capabilities, struggling most with abstract, temporally extended, and paralinguistic audio cues (Chao et al., 14 Dec 2025).

7. Comparative Context and Future Directions

Compared to prior efforts such as VABench (Hua et al., 10 Dec 2025), which focuses on generation and alignment of synchronous audio-video content, JointAVBench directly targets reasoning and understanding in long-form, real-world video. While the JWB-DH-V1 benchmark (Di et al., 28 Jul 2025) is prominent for avatar motion and region-specific AV quality, JointAVBench’s unique contribution is the systematic assessment of reasoning that fundamentally requires multi-modal fusion.

A plausible implication for future research is the development of models with richer temporal memory and more robust audio-visual abstraction, especially in the domains of cross-scene integration and higher-order abstract audio cue understanding. Increasing the scale and diversity of scene spans, elaborating QA taxonomies, and further automating high-quality QA curation constitute promising directions to close the significant headroom revealed by current benchmark results.

References:

"JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation" (Chao et al., 14 Dec 2025)
"VABench: A Comprehensive Benchmark for Audio-Video Generation" (Hua et al., 10 Dec 2025)
"JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1" (Di et al., 28 Jul 2025)