OmniBench: Tri-Modal Reasoning in OLMs

Updated 10 February 2026

OmniBench is a large-scale evaluation suite designed to measure tri-modal reasoning in omni-language models by integrating visual, acoustic, and textual cues.
It employs diagnostic tasks and carefully structured QA pairs that require simultaneous evidence integration from all modalities to prevent reliance on unimodal shortcuts.
Evaluation metrics such as modality disparity, directional imbalance, and Path Balance Score reveal challenges in achieving robust, compositional cross-modal reasoning.

OmniBench and the Evaluation of Tri-Modal Reasoning in Omni-LLMs

OmniBench refers to a family of large-scale evaluation suites specifically targeting the tri-modal (visual, acoustic, and textual) reasoning capabilities of omni-LLMs (OLMs)—models architected to ingest, process, and reason across vision, audio, and language streams simultaneously. These benchmarks have emerged in response to the limitations of earlier multimodal benchmarks, which often focused only on bi-modal settings (text-image, text-audio) and failed to probe the genuinely cross-modal, modality-invariant inference required in real-world, tri-modal tasks. Recent work has clarified both the challenges and the diagnostic methodologies required to evaluate the cross-modal fusion, reasoning, and agency of contemporary OLMs (Wang et al., 16 Oct 2025, Chen et al., 2024, Chen et al., 21 Oct 2025, Kim et al., 22 Aug 2025, Bu et al., 10 Jun 2025, Li et al., 2024, Bie et al., 6 Aug 2025).

1. Formal Definition and Benchmarking Objectives

An omni-LLM (OLM) is formally a system built atop LLM backbones that accepts as input visual ( $V$ ), acoustic ( $A$ ), and textual ( $T$ ) modalities and produces an output $y$ conditioned on a fused latent representation:

$P(y | V, A, T) = P(y | H), \qquad H = \mathrm{Transformer_\text{OLM}}([E_V; E_A; E_T])$

where $E_V$ , $E_A$ , $E_T$ are modality-specific embeddings, and $[\,;\,]$ denotes sequence concatenation (Li et al., 2024). The central benchmarking problem is to evaluate the ability of such OLMs to not only recognize unimodal content but also to perform consistent, modality-invariant, and compositional reasoning when tasks require simultaneous integration of all three channels.

OmniBench-style benchmarks are designed to explicitly enforce and measure the necessity of holistic tri-modal reasoning, often by constructing diagnostic QA pairs where the correct answer cannot be reliably derived from any strict subset of the available modalities (Li et al., 2024, Wang et al., 16 Oct 2025).

2. Benchmark Structure, Task Taxonomy, and Dataset Design

Modern tri-modal benchmarks share a core design pattern: each task instance is designed such that (i) the context and query span at least two, and often all three, modalities; (ii) models are required to integrate evidence—rather than shortcutting via unimodal biases; and (iii) performance is analyzed not only via aggregate accuracy, but also along axes like cross-modal consistency, modality disparity, and directional imbalance (Wang et al., 16 Oct 2025, Chen et al., 21 Oct 2025).

Representative Benchmarks and Task Coverage

Benchmark	Modalities	Size	Task Types/Distribution
OmniBench (Li et al., 2024)	V, A, T	1,142	Entity/File, Causal Inference, Abstract Concept (8 fine tasks)
XModBench (Wang et al., 16 Oct 2025)	V, A, T (6 directions)	60,828	Perception, Spatial, Temporal, Linguistic, External Knowledge (5 families/17 subtasks)
MMAO-Bench (OmniBench) (Chen et al., 21 Oct 2025)	Image, Video, Audio	1,880	44 tasks: Perception, Reasoning (MC and multi-step open)
OmnixR (Chen et al., 2024)	Text, Image, Audio, Video	1,400 (synth) / 100 (real)	STEM reasoning, scientific/lecture questions
CMR-SPB (Kim et al., 22 Aug 2025)	Text, Image, Speech	2,390	Multi-hop reasoning with balanced path permutations

Tasks typically cover:

Perception: fine-grained event/action/entity recognition (with cross-channel alignment).
Spatial and Temporal Reasoning: localization, arrangement, motion, ordering, counting.
Linguistic Understanding & External Knowledge: transcription, translation, knowledge linkage.
Complex/Long-Chain Reasoning: multi-hop logic, compositionality, chain-of-thought.

Benchmarks enforce modality dependence via:

Ablation test gating: excluding instances answerable from any strict subset.
Permutation of context and candidate modalities: ensuring all six context–candidate directions for comprehensive analysis (Wang et al., 16 Oct 2025).

3. Evaluation Metrics and Diagnostic Dimensions

OmniBench-type evaluations move beyond mean accuracy to dissect performance via several tailored metrics (Wang et al., 16 Oct 2025, Chen et al., 21 Oct 2025, Kim et al., 22 Aug 2025):

Accuracy per Modality Composition:

$\mathrm{Acc}_{m\to n} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}[\hat{y}_i^{(m\to n)} = y_i]$

Task Competence: average accuracy over all six directional modality pairs for each task.
Modality Disparity:

$\Delta_{m_1, m_2} = |\mathrm{Acc}_{m_1\to *} - \mathrm{Acc}_{m_2\to *}|$

Directional Imbalance:

$I_{m\to n} = \mathrm{Acc}_{m\to n} - \mathrm{Acc}_{n\to m}$

Multi-hop Path Balance: macro-average accuracy and variance over all possible reasoning paths, with the Path Balance Score (PBS):

$\mathrm{PBS} = \frac{\mu_{\text{path}}}{1 + \sigma_{\text{path}}}$

ensuring that models do not exploit overrepresented reasoning permutations (Kim et al., 22 Aug 2025).

Ablation Deltas: drop in accuracy when a modality is masked, serving as an index of truly joint representation.

Multi-step open-ended questions are scored via aggregated sub-task points, penalizing failures in intermediate reasoning stages (Chen et al., 21 Oct 2025).

4. Empirical Findings and Reasoning Challenges

Recent empirical evaluations of SOTA OLMs using OmniBench-type benchmarks reveal persistent, systemic challenges (Wang et al., 16 Oct 2025, Chen et al., 2024, Chen et al., 21 Oct 2025, Kim et al., 22 Aug 2025, Li et al., 2024):

Modality Disparity: Accuracy plummets (up to 49 points) when context or query is presented in audio vs. text; vision-induced deficits are smaller but significant (~15 points) (Wang et al., 16 Oct 2025).
Directional Imbalance: Systematic performance differences for reversed direction tasks (e.g., $T\to V$ vs. $V\to T$ ), implying that models do not truly operate on modality-invariant abstractions (Wang et al., 16 Oct 2025).
Spatial/Temporal Weaknesses: Even top-tier models (Gemini 2.5 Pro) underperform (≤60%) in spatial and temporal tri-modal reasoning (Wang et al., 16 Oct 2025, Chen et al., 21 Oct 2025).
Short-Board and Synergy Effects: The compositional law observed in MMAO-Bench demonstrates that the joint tri-modal accuracy $P_{\text{omni}}$ scales approximately as a power-law in the product of per-modality accuracies:

$P_\text{omni} \approx k \cdot \left(P_\text{uni}^{(I)} P_\text{uni}^{(V)} P_\text{uni}^{(A)}\right)^\alpha$

where a weak modality bottlenecks overall performance (short-board), while high-performing unimodal modules produce emergent synergy ( $\alpha > 1$ ) in strong models (Chen et al., 21 Oct 2025).

Multi-Hop/Path Sensitivity: Accuracy varies sharply with reasoning path and permutation order; I-T-S paths are notably harder than S-I-T or T-I-S, exposing nonuniform fusion behavior (Kim et al., 22 Aug 2025).
Fusion Brittleness and Modality Conflict: Benchmarks such as OmniPlay explicitly demonstrate that crude fusion mechanisms make models vulnerable to conflicting signals across modalities, sometimes leading to the "less is more" paradox—removal of a modality can improve performance if fusion is not robust (Bie et al., 6 Aug 2025).

5. Architectural and Methodological Trends

Tri-modal OLMs typically implement fusion using extensions of early fusion (embedding concatenation), late fusion (per-modality predictions), or cross-attention blocks parameterized to interleave representations (Li et al., 2024, Bie et al., 6 Aug 2025). Empirical studies indicate current architectures are insufficient for robust, modality-invariant reasoning, especially as fusion layers often fail to resolve ambiguity or propagate uncertainty across channel boundaries.

Instruction-tuning with multi-modal data and chain-of-thought rationales, modality-balanced pre-training, and the explicit use of contrastive objectives or adversarial "modality dropout" are proposed to encourage true cross-modal alignment (Wang et al., 16 Oct 2025, Li et al., 2024). Synthetic augmentation pipelines, such as those in OmnixR, are also used for scalable tri-modal supervision (Chen et al., 2024).

OmniBench for virtual agents extends these concepts by grounding reasoning in interactive, graph-structured tasks (Vision–Language–Action graphs), evaluating the agent's ability to traverse non-linear task graphs while fusing visual state, textual instructions, and discrete actions (Bu et al., 10 Jun 2025).

6. Recommendations, Future Directions, and Open Problems

Systematic benchmarking reveals that OLMs, despite rapid progress, lag far behind humans in modality-invariant reasoning, robust multi-hop inference, and compositional generalization (Wang et al., 16 Oct 2025, Chen et al., 21 Oct 2025, Li et al., 2024). The following future directions and open problems are highlighted:

Model Design: Development of per-layer, gated adapter blocks with learned fusion weights, and new cross-modal attention mechanisms sensitive to entity mapping and uncertainty (Li et al., 2024, Kim et al., 22 Aug 2025).
Curriculum and Supervision: Curriculum-learning frameworks that incrementally integrate entity recognition, event alignment, and causal reasoning; expanded multi-step rationales and open-ended response supervision to force chain-of-thought processes (Chen et al., 21 Oct 2025).
Balanced Benchmarking: Benchmarks must strictly enforce path balance, include sufficient synthetic and real tri-modal data, and analyze variance across all reasoning paths and modality pairings to avoid benchmarking artifacts (Kim et al., 22 Aug 2025, Wang et al., 16 Oct 2025).
Agency and Interactivity: Inclusion of dynamic, long-horizon tasks with explicit synergy/conflict requirements to probe arbitration and planning capabilities under sensory uncertainty (Bie et al., 6 Aug 2025, Bu et al., 10 Jun 2025).
Evaluation Metrics: Continued refinement of diagnostic metrics (e.g., Path Balance Score, directional imbalance) for fine-grained error localization and robust model ranking.
Modality Expansion: Extension toward truly “omni-modal” benchmarks, integrating haptics, 3D spatial reasoning, and interactive dialogue, to match the sensory integration of biological intelligence (Chen et al., 2024, Li et al., 2024).

Research in tri-modal OLM benchmarking continues to surface both architectural weaknesses and diagnostic methodologies necessary for the evolution of robust, modality-agnostic artificial intelligence. The OmniBench suite in its various incarnations provides foundational resources and analytic frameworks for advancing the field.