Monolithic PAI-Bench Evaluation
- Monolithic PAI-Bench is a unified framework that evaluates AI systems on video generation, conditional generation, and understanding using real-world physical dynamics.
- It employs a tripartite evaluation pipeline with curated video cases and task-specific metrics to compare visual fidelity, temporal coherence, and physical plausibility.
- Its analysis reveals a consistent gap between perceptual realism and physically grounded intelligence, highlighting the need for physics-informed AI architectures.
Monolithic PAI-Bench is a unified evaluation framework explicitly designed to rigorously assess the perceptual, generative, and reasoning abilities of AI systems in the domain of real-world physical dynamics. It establishes a monolithic benchmark pipeline under which video generation, conditional video generation, and video understanding are evaluated using a shared pool of real-world cases and task-aligned metrics targeting the operational demands of Physical AI (Zhou et al., 1 Dec 2025). Its architecture enables system-level characterization of the gap between visual fidelity and physically grounded intelligence.
1. Unified Design and Scope
PAI-Bench employs a tripartite evaluation structure—Generation (PAI-Bench-G), Conditional Generation (PAI-Bench-C), and Understanding (PAI-Bench-U)—built from the same curated pool of 2,808 real-world video cases spanning autonomous vehicles, industry, robotics, ego-centric perspectives, human activities, and physical common sense. All evaluation tracks utilize non-synthetic, temporally complex video sources (8–32 frames, often longer), emphasizing true-to-life physical phenomena (collisions, object manipulation, fluid dynamics).
All benchmark cases are paired with physically relevant task prompts, control signals, or question–answer pairs. This unified pipeline ensures direct comparability across generative and recognition tasks, supported by a single “MLLM-as-judge” paradigm—using a multi-modal LLM for semantic and physical plausibility verification in both generative and question-answering settings.
2. Task Tracks: Generation, Conditional Generation, Understanding
PAI-Bench decomposes Physical AI system evaluation into three coordinated tracks:
- PAI-Bench-G (Generation): Free video generation from natural language prompts describing diverse physical scenes or actions. Outputs are video clips evaluated for visual fidelity, temporal coherence, and physical plausibility.
- PAI-Bench-C (Conditional Generation): Conditioned video generation using one or more abstracted/masked signals (blur, edge, depth, segmentation) plus scene text. Models must synthesize high-fidelity, physically plausible video consistent with the control signals.
- PAI-Bench-U (Understanding): Video understanding via multiple-choice questions presented over sampled frames (up to 32). Questions probe both physical common sense (such as spatial, temporal, and physical world reasoning) and embodied reasoning (including action-effect prediction, task completion, and affordance understanding). This structure systematically covers both explicit physical knowledge and embodied cognitive inference.
3. Metrics and Evaluation Protocols
A comprehensive suite of task-specific and cross-task metrics provides quantitative differentiation between visual plausibility, physical fidelity, temporal reasoning, and semantic alignment.
Generation Metrics (PAI-Bench-G)
- Subject Consistency: Average cosine similarity of DINO frame features:
- Background Consistency: As above, using CLIP features.
- Motion Smoothness: L1 error between interpolated and true odd frames; normalized score .
- Aesthetic Quality: Mean normalized LAION score.
- Imaging Quality: Mean normalized MUSIQ score.
- Overall Consistency: ViCLIP-based video–text alignment.
- Image-to-Video Consistency: Single image feature compared across frames.
- Domain Score (Physical Plausibility): Accuracy of MLLM-generated answers to physical QA pairs on the output video:
Conditional Generation Metrics (PAI-Bench-C)
- Blur SSIM, Edge F1, Depth si-RMSE, Mask mIoU: Signal-specific similarity metrics.
- Quality Score: DOVER (no-reference video quality).
- Diversity: Average LPIPS distance between samples:
Understanding Metrics (PAI-Bench-U)
- Classification Accuracy: Fraction correct for each QA group/category:
All tracks employ task-aligned scoring to emphasize physical correctness and real-world semantic fidelity over merely perceptual or aesthetic judgments.
4. Dataset Composition and Domain Coverage
PAI-Bench draws its 2,808 cases from over a dozen public sources (and select proprietary AV data), strictly avoiding synthetic or simulated content. The dataset is partitioned as:
| Track | Videos | Prompts/QAs | Domains Covered |
|---|---|---|---|
| G (Generation) | 1,044 | 5,636 QA pairs | Vehicles, industry, robotics, common sense |
| C (Conditional) | 600 | 1 orig + 5 paraphrased captions per video; 4 control streams | Robotics, driving, ego-centric |
| U (Understanding) | 1,027 | 1,214 QA pairs across 9 thematic groups | Space, Time, Physical, Embodied reasoning |
The dataset design ensures broad coverage of real-world physical phenomena (collisions, manipulation, fluid/rigid body dynamics, affordance), as well as temporal range and complexity necessary for robust generalization and transfer.
5. Experimental Findings and Model Performance
PAI-Bench results reveal sizable discrepancies between perceptual realism and grounded physical intelligence:
- PAI-Bench-G: State-of-the-art video generative models (VGMs), e.g., Wan2.2-I2V-A14B and Cosmos-Predict2.5-2B, achieve Quality Scores ≈ 78 (indistinguishable from real-source videos at 78.0), but only Domain Scores ≈ 87 (relative to 93 for real data). Veo3 achieves DS = 86.8, QS = 77.6. Typical failures involve physically implausible generative dynamics (incorrect trajectories, non-physical object interactions/gravity violations).
- PAI-Bench-C: Multi-signal conditioning with Cosmos-Transfer yields the highest Quality Score (9.24/10) versus single-signal runs. Segmentation fidelity is consistently lowest (mIoU), due to mask supervision noise. Wan-Fun variants achieve higher LPIPS diversity (≈ 0.53) at modest cost to control fidelity.
- PAI-Bench-U: Proprietary MLLMs (GPT-5) reach 61.8% overall accuracy; open-source (Qwen3-VL-235B) reach 64.7%; human baseline is 93.2%. Zero-frame ("text only") inputs revert to chance, and temporal context (16–32 frames) is required to plateau system performance (≈ 65%). Textual chain-of-thought degrades embodied reasoning accuracy, suggesting unimodal reasoning is inadequate for physically grounded tasks.
6. Significance and Implications
By unifying generation, conditional generation, and physical reasoning in a monolithic, physically grounded benchmark, PAI-Bench provides the first reproducible suite for quantifying the extent to which current models harmonize visual, temporal, and physical correctness (Zhou et al., 1 Dec 2025). This framework exposes a consistent, quantifiable gap: models with high visual fidelity often fail to produce or recognize physically plausible outcomes. Leading MLLMs also lag in causal, temporal, and embodied reasoning about physical events depicted in real-world video.
A plausible implication is that progress in Physical AI will require architectures and training regimes capable of integrating long-range sequence modeling, explicit physics priors, and embodied reasoning—beyond improvements to mere generative realism or language-centric modalities. PAI-Bench is positioned as a challenge suite to accelerate and systematically measure gains toward these objectives.