EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

Published 22 Apr 2026 in cs.CV, cs.CL, and cs.RO | (2604.22851v1)

Abstract: While Vision-LLMs (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model's internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces EgoDyn-Bench, a diagnostic VideoQA benchmark to assess physically consistent ego-motion reasoning using deterministic kinematic oracles.
The paper demonstrates that vision-only input underperforms compared to models supplemented with explicit trajectory encoding, exposing a perception bottleneck.
The paper provides a robust evaluation framework with novel metrics like WPCR and temporal accuracy to guide future model improvements.

EgoDyn-Bench: A Diagnostic Benchmark for Ego-Motion Understanding in Vision-Centric Foundation Models

Introduction and Motivation

The transition from explicit, model-based representations of ego-motion in classical autonomous driving systems to vision-centric paradigms leveraging Vision-LLMs (VLMs) and Multimodal LLMs (MLLMs) raises critical issues regarding the physical consistency of high-level reasoning. While recent work has produced advances in semantic and planning tasks, there is a lack of standardized assessment of whether these models can extract physically correct ego-motion concepts directly from video data. This paper addresses the gap by introducing EgoDyn-Bench, the first diagnostic benchmark to explicitly evaluate ego-motion understanding—operationalized as physically grounded semantic reasoning within real-world and simulated driving sequences—across contemporary vision-centric foundation models.

Benchmark Design and Methodological Contributions

EgoDyn-Bench operationalizes ego-motion understanding as a semantic video question-answering (VideoQA) task. Given a sequence of visual observations and a natural language query related to vehicle dynamics, models are tasked to produce semantic responses aligned with physically derived ground truth. The key methodological features and provisions are:

Ground-Truth via Deterministic Oracle: Continuous vehicle kinematic states—including speed, acceleration, jerk, yaw rate, and heading—are mapped to semantic categories (e.g., turning, braking intensity, speed regime) using deterministic, physically calibrated rules. This oracle-based design ensures label objectivity and avoids annotation bias.
Self-Referential Task Formulation: Unlike prior benchmarks focused on object-centric or environmental reasoning, EgoDyn-Bench isolates self-referential physical understanding—probing if models can accurately label and temporally reason about their own motion by semantics alone rather than by pure regression or downstream planning success.
Physically Balanced, Multi-Domain Dataset: The dataset comprises 1,000 3-second clips (balanced between nuScenes real-world data and CARLA-based simulation with style-transferred photometric adjustments) and 14,000 QA pairs. Augmentation and greedy balancing ensure coverage across the entire spectrum of dynamic regimes, addressing the long-tail prevalence of low-dynamic driving in real-world collections.
Figure 1: EgoDyn-Bench maps continuous kinematic ego-states to semantic labels using a deterministic oracle, supporting rigorous video QA with WPCR metrics.

Evaluation Metrics

EgoDyn-Bench introduces several evaluation metrics tailored for the diagnostic setting:

Balanced Accuracy (BAcc) and Macro-F1: For semantic correctness, explicitly mitigating class-imbalance bias.
Temporal Accuracy: For event ordering and comparative queries regarding dynamic changes.
Weighted Physics Consistency Rate (WPCR): A Boolean logic-based metric evaluating whether model answers jointly satisfy physically valid kinematic constraints within each sequence (e.g., "if the model claims the vehicle turned, it must not state 'straight' in concurrent queries"). PCov reports constraint activation coverage.
Figure 3: Model ranking stability under threshold perturbation; high Kendall $\tau$ demonstrates robustness of the “Perception Bottleneck” diagnosis.

Figure 2: WPCR stability with threshold variation; Boolean implication-based physical consistency is invariant to threshold choice.

Empirical Results: Diagnosing the Perception Bottleneck

Through an extensive audit spanning closed- and open-source VLMs (including GPT-5.1, Gemini, Qwen-VL, InternVL, Cosmos-Reason, RoboTron-Drive, DriveMM, and more) as well as multiple classical and learned visual odometry/flow-based baselines, EgoDyn-Bench establishes several critical empirical findings:

Vision-Only Input Fails for Physical Grounding

Across all VLMs and MLLMs, video-only input yields significant performance deficits in semantic ego-motion QA and physical consistency, with balanced accuracies often falling well below 50%. Notably, even the best large models lag behind classical geometric baselines (e.g., KLT-based visual odometry or learned models like TartanVO and RAFT flow).
Scaling model size or incorporating in-domain fine-tuning yields marginal improvements, indicating a fundamental architectural limitation in capturing kinematic state transitions or integrating low-level motion cues from visual input alone.

Explicit Trajectory Encodings Recover Physical Reasoning

Providing explicit, appropriately formatted trajectory information as supplementary text leads to drastic improvements in all metrics, indicating that these models' failure is not due to lack of an internal physics prior, but due to a severe inability to extract such priors from visual stream alone.
Strikingly, when presented with only trajectory text (no images), most models achieve their best physical consistency and temporal ordering scores, with addition of visual frames offering little to no benefit and, in suboptimal encodings, occasionally reducing performance due to modality misalignment.
Figure 4: Dataset curation and augmentation expand coverage of complex dynamic maneuvers, correcting for real-world low-dynamic bias.

Robustness and Generalizability

Sensitivity analysis with respect to semantic thresholding demonstrates that the relative ordering of models is robust ( $\tau > 0.9$ ) to large parameter sweeps, showing that the architectural deficit diagnosed—the "Perception Bottleneck"—is not an artifact of benchmark calibration.

Interactive Infrastructure

The benchmark is supplemented with a web-based human-in-the-loop clip viewer, providing synchronized video, kinematic plots, and QA metadata for model auditing, traceability, and further analysis.
Figure 5: An interactive dashboard for synchronized inspection of video, kinematic signals, and QA pairs, supporting human verification of label and model prediction alignment.

Broader Implications and Outlook

The results expose a structural decoupling in current foundation model architectures: physically-valid ego-motion logic is represented almost exclusively within the language modality, not derived from or further anchored by vision. This architectural asymmetry results in downstream embodied AI systems that are susceptible to trivial errors, limited generalization, and potentially catastrophic physical reasoning failure, even as their high-level semantic and planning performance (on prior benchmarks) may appear strong.

For the research community, EgoDyn-Bench presents a necessary diagnostic axis for the development and validation of physical AI:

Model Design: Diagnostics suggest that pre-training and fine-tuning paradigms for vision-centric models require explicit architectural or training-signal alignment between image sequences and dynamic kinematics, not just additional scale or language-driven instruction following.
Evaluation and Benchmarking: Existing benchmarks that focus on downstream planning or high-level QA are inadequate to guarantee safety or robust physical consistency in embodied, closed-loop scenarios.
Future Research Directions: There is a clear need for new pretext/auxiliary tasks, contrastive-learning frameworks, or cross-modal embedding structures that force deeper coupling of visual and physical reasoning streams, as well as further exploration into structured input representations (e.g., trajectory timeseries) and how these can be integrated during training to form robust, scalable motion priors.

Conclusion

EgoDyn-Bench establishes, with rigorous empirical support, that vision-centric foundation models for autonomous driving fundamentally fail to ground ego-motion understanding in visual perception. Existing architectures default to language priors and require explicit kinematic input for physically-consistent reasoning. Addressing the decoupling of vision and physical reasoning is the outstanding challenge for the field. The benchmark, dataset, codebase, and analysis tools provided are essential infrastructure for developing the next generation of physically aligned, embodied AI systems.

For in-depth experimental design, calibration rules, and additional ablations, consult the supplementary material as detailed in the paper ["EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving" (2604.22851)].

Markdown Report Issue