Papers
Topics
Authors
Recent
2000 character limit reached

PAI-Bench: Unified Physical AI Evaluation

Updated 8 December 2025
  • PAI-Bench is a unified, extensible benchmark that evaluates AI’s abilities in physical perception, video understanding, and simulation using both real and synthetic data.
  • It integrates multiple evaluation tracks, including unconditional and conditional video generation and specialized tasks, to rigorously test physical common sense and control fidelity.
  • The framework employs a modular approach with monolithic and ensemble task breakdowns, enabling detailed performance profiling and driving advancements in physically grounded AI.

Physical AI Bench (PAI-Bench) serves as a unified, extensible, and comprehensive evaluation protocol for assessing the capabilities of artificial intelligence systems in perceiving and predicting physical phenomena from real-world and synthetic data. Its primary goal is to systematically diagnose and advance the limits of contemporary multi-modal LLMs (MLLMs), video generative models (VGMs), and data-driven simulators with respect to physical common sense, video understanding, control, and dynamics simulation. PAI-Bench encompasses both monolithic end-to-end benchmarks using real video and ensembles of specialized tasks covering fundamental and diverse aspects of physical reasoning, control, and simulation (Zhou et al., 1 Dec 2025, Melnik et al., 2023, Otness et al., 2021).

1. Motivations and Foundations

The impetus behind PAI-Bench is to bridge critical gaps in existing evaluation schemes for physical AI. Physical autonomy requires two core competencies: (1) Perception—interpreting video or sensor streams to infer physically meaningful facts such as object positions and their interactions, and (2) Prediction—forecasting future states of the world under physical constraints. Existing benchmarks are largely fragmented, focusing on either abstract symbolic reasoning, isolated frame-level perception, or generative video quality. They lack a unified suite that (a) leverages real-world videos, (b) spans a range of settings including unconditional generation, conditional generation under control, and high-level video reasoning, and (c) employs metrics aligned with physical plausibility and domain reasoning.

PAI-Bench is designed to resolve these deficits by providing a single resource that rigorously probes visual fidelity, prediction under constraint, and the underlying reasoning abilities of models in physical domains (Zhou et al., 1 Dec 2025).

2. Structure and Task Breakdown

There are two major forms of PAI-Bench. The first is the monolithic PAI-Bench (2025), which comprises three integrated tracks developed over 2,808 cases derived from real open-source or web video benchmarks:

  • PAI-Bench-G (Generative): Unconditional video generation from textual prompts spanning six domains (Autonomous Vehicles, Robotics, Industry, Human Activities, Physical Common Sense, Intrinsic Physics) using 1,044 video–prompt pairs and 5,636 corresponding QA pairs for domain-based evaluation.
  • PAI-Bench-C (Conditional): Conditional video generation, where models receive reference videos plus control signals (blur, edges, depth, segmentation) and must reconstruct or diversify videos while satisfying physical and semantic constraints, sampled across robotics, driving, and egocentric activity domains (600 videos).
  • PAI-Bench-U (Understanding): Video-understanding tasks with 1,164 annotated videos and 1,214 QA pairs, further split between physical common sense (space, time, physical world) and embodied reasoning (task verification, next-action prediction).

Another instantiation of PAI-Bench (2024) operates as an ensemble of 16 specialized benchmarks for physical reasoning (Melnik et al., 2023). Each task tests distinct facets, such as one-shot action (PHYRE), tool selection (Virtual Tools), passive conceptual separation (Physical Bongard Problems), video-based QA (CRAFT, CLEVRER), forward prediction (SPACE, CoPhy), stability detection (ShapeStacks), plausibility classification (IntPhys), language-based imitation (Language Table), and more, as detailed in the following table:

Benchmark Task Type Core Challenge
PHYRE Single Interaction Sample-efficient 2D physics
Virtual Tools Single Interaction Tool use, action placement
Phy-Q Continued Action Multi-step Angry Birds-style
ShapeStacks Binary Classification 3D stability detection
CRAFT Video QA Counterfactual, descriptive QA
CLEVRER Video QA Causal, predictive questions
CoPhy Counterfactual Forecast 3D pose, intervention

The ensemble approach promotes modularity and model profiling via skill-level vectors, with normalized performance across clusters defined by interaction, concept recognition, world modeling, and language (Melnik et al., 2023).

3. Metrics and Evaluation Protocols

PAI-Bench employs task-aligned, physically grounded metrics that contrast sharply with conventional aesthetic or frame-level accuracy scores. Key evaluation axes include:

  • Visual Quality (PAI-Bench-G/C): Metrics such as subject and background consistency (DINO and CLIP features), motion smoothness, aesthetic (LAION) and imaging (MUSIQ) quality, and video–text alignment (ViCLIP).
  • Physical Plausibility (Domain Score): MLLM-based correctness on curated QA pairs representing spatial relations, causal chains, and object properties. The domain score is the average accuracy of an advanced MLLM (Qwen3-VL-235B) on these QA sets, partitioned by domain.
  • Control Fidelity (PAI-Bench-C): Control signal reproduction quantified by blur SSIM, edge F1, depth si-RMSE, mask mIoU, plus aggregate alignment.
  • Video Understanding (PAI-Bench-U): Per-category accuracy for common sense (space, time, physical world) and embodied reasoning sub-domains (e.g., BridgeData, AgiBot).
  • Diversity and Robustness: E.g., LPIPS for output diversity, bias checks via performance with zero/one/32 input frames.
  • Stability, Accuracy, Efficiency (Simulation): For simulation-oriented tasks, metrics include rollout stability (finite norm), trajectory MSE, normalized L2 error, energy drift, runtime ratio, and effective integrator step scaling (Otness et al., 2021).
  • Skill-Level Vectors (Ensemble): A normalized vector SR16S \in \mathbb{R}^{16} over all sub-benchmarks, and cluster-averaged scores S(cluster)S^{(\mathrm{cluster})}, for comprehensive capability profiling.

4. Baseline Models and Observed Limitations

A spectrum of proprietary and open-source VGMs and MLLMs have been systematically evaluated:

  • Video Generation: 15 VGMs, e.g., DeepMind Veo3, Cosmos-Predict2.5-2B, Wan2.2-I2V-A14B, achieve Quality Scores in 73.7–78.0 (top open 78.0), but Domain Scores (realistic physical reasoning) lag by ∼10 points (top open 87.1 vs. ground-truth 89.8), implying high visual fidelity does not entail physical plausibility (Zhou et al., 1 Dec 2025).
  • Conditional Generation: Cosmos-Transfer and Wan2.2 series models show high blur fidelity (SSIM=0.91), moderate (F1≈0.45) edge alignment, depth and mask scores, with multi-signal control yielding best overall results. However, segmentation mask control and edge consistency remain recurring failure modes.
  • Video Understanding: Top MLLMs like GPT-5 (minimal “thinking”): 61.8%, GPT-5 (medium): 69.8%, Qwen3-VL-235B: 64.7%, markedly below the human ceiling (93.2%). Time-based queries perform best, while spatial and embodied reasoning lags, especially on challenging tasks (e.g., BridgeData: ~35%).
  • Simulation Baselines: PAI-Bench for simulation (Otness et al., 2021) incorporates classical integrators (Forward Euler, Leapfrog, RK4, Backward Euler) and machine learning predictors (KNN, kernel regression, MLP, CNN, U-Net), benchmarked on ODE and PDE systems (1D/2D oscillators, Navier-Stokes), with separate stability and accuracy reporting.

Failures are frequently attributable to violations of physical constraints (e.g., impossible collisions, discontinuous trajectories), next-action prediction under novel contexts, and the exploitation of language priors in lieu of genuine perception or prediction (Zhou et al., 1 Dec 2025, Melnik et al., 2023).

5. Insights and Comparison with Prior Work

PAI-Bench departs from prior single-focus and holistic benchmarks by providing both realistic, rich video tracks and a systematic decomposition into orthogonal components of physical reasoning. Compared to earlier physics simulation benchmarks (e.g., four-canonical-system PAI‐Bench (Otness et al., 2021)), PAI-Bench (2025) and the ensemble PAI-Bench (2024) emphasize:

  • Unified, real-world video datasets and complex, multi-signal conditional control.
  • Meticulous, ontology-driven QA annotation and domain-specific evaluation.
  • Modularity and diagnostic skill-level vectors, allowing for curriculum-based evaluation and model profiling across interaction, concept, world modeling, and language clusters.
  • Standardized access, reproducibility, and extensibility via open-source code and protocols.

The ensemble methodology aids in dissecting the “why” of model failures—something single, holistic tasks often obscure due to conflated challenges of perception, action, planning, and language (Melnik et al., 2023).

6. Directions for Advancement

Key recommendations for advancing physical AI using PAI-Bench include:

  • Data Curation: Acquisition of large, physically annotated datasets (forces, velocities, collisions) to internalize genuine physics in generative and reasoning models; augmentation of embodied corpora with hierarchically structured, multi-agent, and multi-modal interactions.
  • Modeling Innovation: Integration of hybrid world models fusing differentiable physics engines with generative architectures; introduction of explicit “visual thought” modules that interleave visual cognition and simulation; development of multi-modal chain-of-thought processes for deep visual–textual grounding.
  • Metric Enhancement: Deployment of stronger, next-generation semantic judges for video–text consistency; incorporation of trajectory- and collision-based physical metrics (e.g., velocity error Evel=1Tt=1Tvtv^t2E_{\mathrm{vel}} = \frac{1}{T}\sum_{t=1}^T \|v_t - \hat v_t\|^2).
  • Interactive Extensions: Addition of online, real-time forecasting and closed-loop embodied tasks with physical feedback, moving toward real-world robotics and autonomous systems benchmarks.

These trajectories collectively underscore that, while visual and linguistic generative abilities have advanced, true physically grounded intelligence—especially in reasoning, forecasting, and action—remains a central open problem (Zhou et al., 1 Dec 2025).

7. Significance and Usage Recommendations

PAI-Bench establishes itself as both a diagnostic instrument for mapping the current frontiers of physical AI and a platform for curriculum-based, modular, and extensible evaluation. It supports capability discovery at both the system and subskill level, enabling researchers to target “concept recognition,” “future prediction,” or “language understanding” in isolation before advancing to integrated generalist agents.

Usage recommendations include: selecting appropriate capability clusters, training and evaluating on each subtask, normalizing and aggregating performance using the skill-level vector protocol, and reporting overall physical reasoning capacity as a (weighted) sum of cluster-specific scores (Melnik et al., 2023).

Overall, PAI-Bench provides a rigorous, transparent, and physically salient foundation for benchmark-driven innovation in physical AI, highlighting significant performance gaps and informing the next generation of physically competent and robust AI systems (Zhou et al., 1 Dec 2025, Melnik et al., 2023, Otness et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Physical AI Bench (PAI-Bench).