MMAR Benchmark: Multimodal Reasoning
- MMAR benchmark is a standardized suite that assesses deep cross-modal reasoning in audio and vision-language systems using real-world data.
- It employs a multi-layered reasoning taxonomy and expert-curated Chain-of-Thought rationales to evaluate performance across diverse, challenging tasks.
- Baseline comparisons reveal significant robustness challenges and improvements from interventions like MATA in addressing modality imbalances.
The MMAR benchmark is a standardized evaluation suite for assessing the deep reasoning capabilities of multi-modal models—especially audio-language and vision-language systems—across a spectrum of challenging real-world scenarios. Two distinct lines of MMAR benchmarks exist: one focuses on audio, speech, and music reasoning and the other on joint vision-language modeling and probabilistic generation. Both lines share an emphasis on diverse, real-world multimodal data requiring cross-modal understanding, compositional inference, and robust reasoning (Ma et al., 19 May 2025, Wang et al., 23 Sep 2025, Yang et al., 2024, López et al., 6 Oct 2025).
1. Benchmark Structure and Dataset Composition
Audio MMAR:
The primary MMAR benchmark for audio reasoning consists of 1,000 audio–question–answer triplets drawn from uncontrolled, “in-the-wild” internet videos. Each audio clip belongs to one of three single-modality categories (Sound, Music, Speech) or four mixed-modality categories (combinations of the three modalities, such as Sound+Music+Speech) (Wang et al., 23 Sep 2025, Ma et al., 19 May 2025). The duration of clips averages about 20 seconds, with a hard limit of 30 seconds (Ma et al., 19 May 2025). A five-stage expert and LLM-curated pipeline was used to ensure high fidelity, introducing a hierarchical taxonomy with four reasoning layers:
- Signal (acoustic feature identification)
- Perception (source/scene abstraction, spatial or temporal dynamics)
- Semantic (intent, event, and content analysis)
- Cultural (knowledge, social inference, domain expertise)
Every item is accompanied by a Chain-of-Thought (CoT) rationale outlining the logical or mathematical steps required for correct inference. The dataset is intentionally curated to avoid overlap with common training corpora such as AudioSet and to include novel, multi-hop reasoning items, including many requiring graduate-level perceptual and/or domain expertise (Ma et al., 19 May 2025).
Vision MMAR:
A separate MMAR line focuses on continuous probabilistic modeling over image and text modalities, using a suite of 18 visual question-answering tasks ranging from diagram understanding (AI2D) to mathematical reasoning (MathVista) and multi-modal scene analysis (MMBenchEN, MMBenchCN, etc.) (Yang et al., 2024). Here, each evaluation involves an image or visual input and a text query, with metrics unified to accuracy or task-specific scores.
2. Reasoning Task Formulation and Input/Output Protocol
Audio MMAR:
Each audio clip is paired with a question and a set of multiple-choice answers (usually four). Inputs to the system consist of:
- Raw audio waveform or pre-extracted features (tokenized as )
- Natural-language question, tokenized as Both token streams are concatenated (with modality-type embeddings) and passed to a Transformer-based decoder for auto-regressive answer generation (Wang et al., 23 Sep 2025). The model’s task is to select (or generate) the correct answer, whose format is consistent with the provided options.
Vision MMAR:
The vision-language protocol is similarly formulated: a joint input consisting of an image (as continuous-valued tokens) and a text query is passed to an auto-regressive or diffusion-augmented model, which produces either a direct answer or a generated text sequence (Yang et al., 2024).
3. Evaluation Metrics and Robustness Protocols
Accuracy is the principal metric, defined per category and averaged across all categories or sub-tasks. The formula is
where is the number of questions in layer and is the indicator function (Ma et al., 19 May 2025, Wang et al., 23 Sep 2025).
Robustness Analysis in MMAR (audio) includes systematic evaluation under multiple perturbations of the input text:
- Choice Ordering (24 permutations): examines if answer changes under option shuffling.
- Question Rephrasing, Ground-truth Rephrasing, Distractor Rephrasing: introduces alternate phrasings to test input-invariance.
- Mix-of-perturbations: applies all perturbations with independent probability 0.5 per axis (López et al., 6 Oct 2025).
Supplementary metrics:
- Consistency Rate (CR): fraction of times a model’s answer remains invariant across perturbations.
- Correctness Rate (CoR): mean accuracy over all perturbations.
These protocols reveal that even state-of-the-art models remain brittle: e.g., for Qwen2.5-Omni-7B, average accuracy drops from 59.0% (default) to 51.5% (distractor-rephrasing) and correctness rate in a mix-of-perturbations can fall below 30% (López et al., 6 Oct 2025). This highlights the importance of reporting robustness metrics in addition to standard accuracy.
4. Baseline Results and Model Comparisons
Audio MMAR
Comparative test results for selected models without and with the MATA intervention (see next section):
| Model | Sound | Music | Speech | S–M | S–Sp | M–Sp | S–M–Sp | Avg. |
|---|---|---|---|---|---|---|---|---|
| Gemini 2.0 Flash | 61.2 | 51.0 | 72.1 | 81.8 | 72.5 | 65.9 | 70.8 | 65.6 |
| Qwen-2.5-Omni-7B | 58.8 | 40.8 | 59.9 | 54.6 | 61.9 | 67.1 | 58.3 | 56.7 |
| Qwen-2.5-Omni-7B + MATA | 55.8 | 53.4 | 65.0 | 45.5 | 68.4 | 63.4 | 54.2 | 61.2 |
| Ke-Omni-R-7B (RL-tuned) | 65.5 | 54.9 | 64.6 | 63.6 | 71.1 | 64.6 | 62.5 | 64.1 |
| Ke-Omni-R-7B + MATA | 66.7 | 53.9 | 70.1 | 72.7 | 73.9 | 69.5 | 62.5 | 66.8 |
MATA intervention yields average improvements of +4.5 percentage points (Qwen-2.5-Omni-7B) and +2.7 points (Ke-Omni-R-7B), with the latter surpassing Gemini 2.0 Flash (Wang et al., 23 Sep 2025). Open-source LALMs otherwise perform in the 40–57% range; closed-source GPT-4o Audio is reported at 63.5%, and Gemini 2.0 Flash at 65.6% (Ma et al., 19 May 2025).
Vision MMAR
Benchmarking MMAR-7B (vision) shows mean accuracy (AVE@18Und.) of 46.52 on 18 visual understanding tasks, representing a +40.7% improvement over the best prior 7B-sized baselines (Yang et al., 2024). Scores by task range from +125.1% (MMBenchEN) to –43.6% (DocVQA). Scaling MMAR from 0.5B to 7B parameters increases mean accuracy by 34.6% (34.56 to 46.52).
5. Key Technical Innovations: MATA Intervention and Model Architectures
MATA Intervention (“Pay More Attention To Audio”): A training-free method designed to counteract audio-text attention imbalance in Transformer-based LALMs (Wang et al., 23 Sep 2025). Let be the query, key, and value matrices for concatenated audio+text tokens. The standard (pre-softmax) attention score is . MATA rescales the last query position over audio-token keys by a factor 0, where 1 is optimal:
2
No additional parameters or retraining are required; the intervention is immediately before softmax in intermediate layers (typically layers 10-20). Empirical ablations confirm largest gains in this layer range and for 3; computational overhead is negligible (Wang et al., 23 Sep 2025).
Vision MMAR Modeling:
A joint auto-regressive probabilistic framework over continuous image and discrete text tokens (Yang et al., 2024). Diffusion-based heads reparameterize the image modeling objective for greater information fidelity (“v-prediction”), and a two-stage training regime balances image understanding (4) and generation (5), with both vanilla and instruction-tuned variants benchmarked.
6. Limitations, Error Analysis, and Recommendations
Performance Gaps and Failure Modes:
- Low-Level Reasoning: The Signal Layer (fine acoustic discrimination) yields lowest accuracy, especially on music-related items, exposing model limitations in detailed waveform analysis (Ma et al., 19 May 2025).
- Cross-Modal Integration: Mixed-modality items present further difficulty; models over-focus on speech/text tokens if not appropriately regularized—hence the need for MATA (Wang et al., 23 Sep 2025).
- Fragility: MCQA accuracy is highly sensitive to formatting/rephrasing of questions and answer options; distractor paraphrase in particular can halve correctness rates (López et al., 6 Oct 2025).
- Knowledge Gaps and Reasoning Errors: Perceptual misestimation, faulty causal chaining, and missing external domain knowledge are frequent (e.g., failings in Morse code, dialects, or instrument identification).
- Closed vs. Open-source Gap: Closed-source models (e.g., Gemini 2.0 Flash, GPT-4o Audio) still lead by a substantial margin in raw and robust accuracy over best open-source architectures (Ma et al., 19 May 2025, Wang et al., 23 Sep 2025).
Recommendations:
- Expand coverage, especially of underrepresented modalities (e.g., polyphonic music, complex audio scenes).
- Develop stronger, pre-trained audio and vision encoders that reliably preserve fine temporal/spectral or spatial details.
- Incorporate explicit knowledge resources (audio/vision knowledge graphs).
- Establish robust evaluation reporting (mean, standard deviation, consistency/correctness rates) especially under text perturbations for fair comparison (López et al., 6 Oct 2025).
- Explore hybrid approaches that combine the advantages of caption-based pipelines and end-to-end multimodal reasoning.
7. Significance and Outlook
MMAR is now established as a principal benchmark for holistic multi-modal reasoning in audio, vision, and text, superseding simplistic classification or tagging tasks with real-world, hierarchical inference challenges (Ma et al., 19 May 2025, Wang et al., 23 Sep 2025, Yang et al., 2024). Its task diversity, rigorous curation (including CoT rationales), and robustness protocols ensure ongoing relevance for the evaluation of large multimodal models, reasoning architectures, and model interventions. The continued gap between open- and closed-source model performance, especially in robustness metrics, underscores the benchmark’s importance as both a diagnostic and developmental tool for the next generation of audio- and vision-centric AI.