Papers
Topics
Authors
Recent
Search
2000 character limit reached

Omni-Bench: Unified AI Evaluation

Updated 4 June 2026
  • Omni-Bench is a collection of benchmarks that evaluate AI models across multiple modalities, highlighting limitations in integrated reasoning and performance.
  • It encompasses diverse domains including tri-modal reasoning, virtual agents, materials science, bioinformatics, and tabular modeling, each with unique evaluation paradigms.
  • The framework provides actionable diagnostic metrics and datasets to foster model development and expose challenges like modality collapse and error accumulation in complex tasks.

Omni-Bench refers to a set of independently developed large-scale benchmarks and methodologies unified only by their aim to advance holistic, multi-dimensional evaluation in artificial intelligence research. While the term "Omni-Bench" (and close variants) appears across multiple subdomains—including tri-modal language modeling, embodied agents, graph-based virtual agents, multimodal scientific reasoning, tabular modeling, world and vision modeling, and bioinformatics—each instantiation targets a distinct axis of broad, cross-cutting capability assessment. The sections below catalog the primary paradigms, technical designs, and scientific implications of key works titled or referenced as "Omni-Bench" across contemporary literature.

1. Tri-Modal Reasoning: OmniBench for Omni-LLMs

OmniBench is a benchmark designed to rigorously evaluate omni-LLMs (OLMs)—models that process images (II), audio (AA), and language (TT) inputs jointly and return text outputs. OLMs must implement the function

f:I×A×TTf: \mathcal I \times \mathcal A \times \mathcal T \longrightarrow\mathcal T

Critically, each item in the benchmark requires truly integrated multimodal reasoning: neither image nor audio alone suffices to answer. The dataset comprises 1,142 multiple-choice QA pairs in categories spanning causal inference, temporal-spatial entity recognition, and abstract concept reasoning, sourced and QA'd via a multi-stage pipeline with both human and ablation-based validation to enforce true tri-modal dependence. Each sample is annotated with both image and audio rationales.

Evaluation is centered on accuracy; the random-guess baseline for four options is 25%25\%. Baseline open-source and proprietary OLMs achieve, at best, 42.91%42.91\% (Gemini-1.5-Pro), substantially below human-performance (>90%90\%). Open models display key failure modes: capacity scaling does not confer monotonic fusion improvements, most models collapse on a single modality, and abstract reasoning remains unsolved (with sub-20% accuracy for Text/Symbol and Quantity tasks). To seed further progress, the OmniInstruct dataset supplies 93k+ tri-modal instruction-tuning samples, filtered to enforce integrated multimodal grounding. Proposed tri-modal integration strategies include supervised fine-tuning with modality-dropout, modality-balanced sampling, and ablation-aware input regimes. Fundamental challenges remain: true joint fusion still eludes current architectures, and “modality collapse” and data diversity limitations persist (Li et al., 2024).

2. Omni-Dimensionality: UmniBench and the Unified Evaluation of Multimodal Models

UmniBench introduces an omni-dimensional evaluation suite for unified multimodal models (UMMs), handling understanding, generation, and editing within a single pipeline. The evaluation loop comprises: (A) image generation from entity/spatial attribute prompts, (B) interaction/editing (causal, attribute-changing, counterintuitive edits), and (C) counterfactual scene replacement, with self-consistency checks at each stage.

The dataset spans 13 domains and >200 concepts, with each concept yielding three cases (original, interaction, counterfactual). Each case is probed using 9 prompt-derived questions that the UMM itself is tasked to answer about its own outputs (thus minimizing reliance on external rating models).

Scoring is based on total QA accuracy, per-stage scores, and correlation (PLCC/SRCC) with human ratings (found to be high: 0.8\approx 0.8). The framework also supports decoupled evaluation: single-ability models may be plug-substituted to isolate and diagnose specific failure sources (e.g., generation, editing, understanding). Benchmarked models exhibit monotonic performance drop across sequential tasks (from 84%\approx 84\% in generation to 60%\approx 60\% in counterfactual reasoning), with counterfactual-inference and complex domains (e.g., spatial) posing notable difficulty. The unified protocol exposes error accumulation and domain shift in long-horizon tasks, and suggests extension to video, audio, and 3D inputs as critical next steps (Liu et al., 19 Dec 2025).

3. Multidimensional Virtual Agent Evaluation: Task Graphs and Coverage Metrics

The OmniBench framework for virtual agent capabilities introduces a scalable, automated pipeline for synthesizing graph-structured multi-application tasks, addressing previous limitations in scenario diversity, manual annotation, and granularity of evaluation. Each task is a DAG, with five tunable complexity dimensions (dependency, instruction, knowledge, hierarchy, branch width) corresponding to compositional subtask arrangement and multi-app requirement.

Evaluation exploits OmniEval, comprising subtask-completion checks and two graph-based metrics:

  • Coverage Rate (CR): Weighted sum of completed nodes along the task-DAG, emphasizing depth.
  • Logical Consistency (LC): Fraction of agent execution steps that maximally “stick within” the same application, relative to all valid topological sorts.

Additionally, OmniBench tracks performance on ten fine-grained capabilities (e.g., Parallel Planning, Long-Range Planning, Cross-Domain Knowledge), constructed by varying complexity constraints. The dataset contains >36k compositional tasks across 20 scenarios. State-of-the-art LLM-based and GUI agents reach 19–39% CR (GPT-4o tops at 38.7%), far below human upper bound (80.1%). Chain-based (linear) tasks are notably easier than graph-structured ones (20.5% CR vs. 48.8% for GPT-4o). Task “intent” extraction and prompt reordering are critical for coherent planning, while performance decays with task complexity. This framework enables the first controlled, large-scale, compositional, and multi-metric assessment of generalist software agents (Bu et al., 10 Jun 2025).

4. Large-Scale Scientific Reasoning: OmniMatBench in Materials Science

OmniMatBench targets end-to-end reasoning in materials science across 19 subfields, synthesizing 3,171 expert-constructed problems (split into open-ended QA and structured calculation problems). The benchmark spans text, images, tables, and symbolic formulae, with outputs scored over multiple answer slots and requiring both scientific fluency and precise computation.

Evaluation uses macro-averaged F₁ for QA and per-slot accuracy for CAL; the composite overall score averages these. Closed-source MLLMs achieve up to 0.372 overall (Claude Opus 4.7), with open-source models lagging by >0.10 points. Model limitations include over-reliance on familiar formulas (“fixed heuristics”), uneven knowledge in specialized engineering protocols, and weak variable grounding from multimodal cues. Even with oracle formulas, models seldom exceed 40% calculation accuracy, highlighting persistent deficiencies in both reasoning and execution.

A key insight is the necessity of coupling domain ontologies, formula libraries, fine-tuned multimodal grounding, and robust output enforcement for practical AI support in scientific domains (Liu et al., 28 May 2026).

5. Embodied Intelligence: Cross-Skill and Cross-Embodiment Navigation Benchmarking

OmniNavBench is explicitly designed for embodied navigation, focusing on cross-skill sequencing (composite missions combining up to six navigation primitives) and cross-morphology generalization (identical tasks executed by wheeled, quadrupedal, and humanoid robots). The environment comprises >1,700 human-telemetered trajectories and 7,116 natural language instructions spanning 170 environments (synthetic and real scans).

Evaluation employs a suite of metrics encompassing not just end-to-end success and SPL, but sub-goal completion, social compliance (Social Intrusion Index), human-follow fidelity, and embodied QA accuracy. Baseline algorithms (<9% end-to-end SR on hardest morphology; SGC up to 44% but end-to-end near zero) reveal a substantial gulf between current models and real-world task requirements—primarily in compositional planning, social navigation, and termination skills. Morphology sensitivity and distribution shift between environments remain critical open problems (Sun et al., 10 May 2026).

6. OmniBench for Tabular Modeling: Large-Scale Empirical Comparison and Metafeature Analysis

OmniTabBench systematizes empirical evaluation for tabular data modeling, comprising 3,030 deduplicated and filtered datasets mapped to diverse application domains. Candidate models span gradient-boosted decision trees (GBDTs), various neural network (NN) architectures, and transformer-based foundation models (TFMs; notably TabPFN).

Key results demonstrate no universally dominant model class: TFM (TabPFN) is optimal for small, regular datasets; NNs dominate on large and highly categorical settings; GBDTs excel under high skewness/kurtosis. Coverage of >1,000 datasets is shown essential for stable rank ordering. Rather than aggregate metafeature “irregularity scores,” OmniTabBench quantifies model-performance as a function of individual metafeatures (n, p, skew, kurtosis, AA0). Actionable recommendations target model selection by regime, while highlighting persistent gaps in TFM scalability and NN hyperparameterization (Jiang et al., 8 Apr 2026).

7. Modular, Open Bioinformatics Benchmarking: Omnibenchmark System

Omnibenchmark (alpha) facilitates formalized, scalable benchmarking in bioinformatics via a YAML-based specification language for datasets, methods, metrics, and execution environments. Workflows are dynamically generated in Snakemake, and all steps—including environment setup (EasyBuild, Apptainer, conda), metric computation, storage, and results versioning—are orchestrated for point-and-click reproducibility and open (community) sharing via S3-compatible object storage.

The system supports both solo and collaborative modes, abstracting platform idiosyncrasies (Linux, macOS), and integrating semantic versioning of benchmark “releases.” By design, it incentivizes open, FAIR-compliant practices, though limitations remain in format validation, ARM64 support, and live queuing integration (Mallona et al., 2024).


Summary Table: Notable "Omni-Bench" Instantiations

Subdomain Scope/Target Evaluation Paradigm
Tri-modal LMs (I, A, T) → T (joint reasoning) QA accuracy, ablation, tri-modal tuning
Multimodal models Gen/Understand/Edit, peri-image QA over Gen→Edit→CF chain, self-marking
Virtual agents OS/web/app, DAG-structured tasks Subtask CR, LC, 10 capability axes
Materials science QA + Calculation (text/image/table) Slot-based F₁, Acc, expert rubric
Navigation agents Multi-skill, multi-embodiment SR, SPL, SGC, social/human QA
Tabular ML Model family selection Large-scale F₁/R², metafeature analysis
Bioinformatics Workflow formalization & execution YAML→Snakemake, container aware

Omni-Bench in each manifestation aims to close a fundamental gap in multimodal, compositional, or generalist evaluation: enforcing end-to-end integration of diverse knowledge forms, reasoning abilities, and modalities, with domain-calibrated diagnostics to inform the development of next-generation AI models. No singular model or system has yet demonstrated human-comparable performance across these omni-dimensional benchmarks, illustrating the scope of remaining challenges.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Omni-Bench.