Papers
Topics
Authors
Recent
2000 character limit reached

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models (2512.19526v1)

Published 22 Dec 2025 in cs.AI

Abstract: Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.

Summary

  • The paper introduces a novel benchmark that quantitatively evaluates physical reasoning in VLMs by measuring kinematic properties from videos.
  • It employs diverse data sources and controlled experiments to reveal that models rely more on pre-trained priors than on precise pixel-level inferences.
  • Experimental results show that even top models fall short of human-level reasoning, highlighting the need for innovative training and architectural strategies.

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-LLMs

Introduction and Motivation

QuantiPhy addresses the persistent gap in evaluating vision-LLMs (VLMs) on quantitative physical reasoning. Existing benchmarks for VLMs and embodied AI largely rely on qualitative or multiple-choice visual question answering (VQA) formats, fundamentally limiting insights into whether models genuinely infer or merely memorize kinematic properties such as size, velocity, and acceleration. Human-level physical reasoning routinely integrates visual observation with quantitative estimation, a capacity critical in downstream domains such as autonomous driving, robotics, and AR/VR. QuantiPhy systematically formalizes VLM evaluation on the task of quantitatively inferring kinematic properties from videos, positioning itself as a necessary next step for the field.

Benchmark Task Design

QuantiPhy defines the quantitative kinematic inference task as follows: Given a short video and a single prior on a physical property (either object size, instantaneous velocity, or acceleration, specified in real-world units as text), the model is asked to estimate other kinematic quantities of a (possibly different) target object at specified times. The underlying measurement is always grounded in world coordinates, distinguishing between raw pixel measurements and real-world-scale inference by leveraging the given prior.

The benchmark comprises four core task combinations based on two axes—movement dimensionality (2D vs. 3D) and the nature of the prior (Static: size; Dynamic: velocity or acceleration):

  • 2D-Static: Inferring quantities in planar scenes with static priors.
  • 2D-Dynamic: Inferring in 2D with motion-derived priors.
  • 3D-Static: Volumetric reasoning with static priors.
  • 3D-Dynamic: 3D with temporal priors. Figure 1

    Figure 1: Sample examples from QuantiPhy, illustrating the four core task combinations defined by Dimensionality {2D, 3D} and Physical Prior {Static, Dynamic}.

Videos and associated tasks are balanced across these categories to cover diverse kinematic events and difficulty regimes.

Dataset Collection and Annotation

Video data is curated across three sources: Blender simulation, multi-camera lab captures, and carefully selected internet videos. Blender-generated content allows precise control over motions and environment, providing high-fidelity ground truth for size, velocity, and acceleration. Lab-captured videos are annotated using a custom interface for pixel-accurate depth readings. Annotators further ensure world-to-pixel mapping through visible reference objects and measure pixel-space quantities, which, combined with the given prior, yield scaling factors for quantitative conversion.

To probe the effect of visual context, videos are rendered or post-processed with varying background complexity—uniform color, simple, or cluttered (complex)—and with either single or multiple moving objects for reference. Figure 2

Figure 2: The construction of QuantiPhy proceeds in three sequential stages: collection of raw videos from three sources, segmentation with solid backgrounds, and annotation of (prior, question, ground truth) triplets.

Evaluation Protocol and Metric

Evaluation covers 21 state-of-the-art VLMs (6 proprietary, 15 open-weight), with human baselines contextualized via a controlled user study using the same interface and priors. The main metric is Mean Relative Accuracy (MRA): answers receive partial credit based on whether their relative error falls below a set of strict thresholds, yielding a robust quantitative assessment that avoids the brittleness of exact-match or the insensitivity of qualitative scoring.

Experimental Results and Analysis

Aggregate Model Performance

Closed-source models (ChatGPT-5.1, Gemini-2.5 Pro/Flash) perform best, with top MRA scores of ~53.1—comparable to, but not exceeding, human annotator averages. The best open-weight model, Qwen3-VL-Instruct-32B, closely follows, while smaller or earlier systems lag substantially. Notably, scaling within a model family (e.g., Qwen3, InternVL) improves quantitative reasoning, especially in dynamic (temporally variable) tasks.

Discrepancy Between Visual Input and Model Output

Despite access to precise video and textual input, most models systematically underperform the theoretical optimum achievable by exact pixel-level algebraic computation. Analysis confirms a strong bias toward pre-trained world knowledge and semantic priors, rather than faithful exploitation of visual cues and explicit numerical priors. The gap becomes pronounced in the counterfactual setting, where models are given implausible priors; their outputs remain anchored to memorized magnitudes, disregarding the altered input. Figure 3

Figure 3: Effect of scene context. Mean Relative Accuracy (MRA) for all models is plotted across background complexity and object count conditions, illustrating mild background effects and strong gains from multiple objects.

Qualitative Case Studies

Case analyses of top-performing systems underscore this lack of input faithfulness. In tractable cases, models apply the correct pixel-to-world scaling and produce accurate inference; but under counterfactual or ambiguous priors, they revert to statistical heuristics or canonical values (e.g., reporting 9.8m/s29.8 \text{m/s}^2 for gravity regardless of simulation parameters). Figure 4

Figure 4: Faithful pixel–prior reasoning in which the model executes the correct sequence of pixel measurement, scale estimation, and quantitative inference.

Figure 5

Figure 5: Counterfactual prior breaks faithfulness; model ignores the modified prior and defaults to memorized size ratios.

Figure 6

Figure 6: Video ablation reveals reliance on priors; with video removed, model still outputs plausible values derived from parametric knowledge.

Figure 7

Figure 7: Strong gravitational prior overrides counterfactual physics, with the model hallucinating canonical gravity regardless of video evidence.

Prompting and Structured Reasoning

Chain-of-thought (CoT) prompting, which decomposes the task into subproblems (pixel measurement, scale calculation, inference), fails to close the gap or significantly enhance accuracy except in isolated cases. Error propagation often leads to degraded performance, indicating limitations in robust intermediate quantitative reasoning ability.

Implications and Future Directions

Theoretical implications are clear: The dominant VLMs, even at scale, do not yet robustly ground quantitative physical inference in pixel-level visual evidence and explicit numeric priors. They instead behave as powerful pattern-matching engines optimizing for verbal plausibility, not physically faithful measurement or extrapolation.

Practically, this limits their deployment in safety-critical applications such as embodied robotics, self-driving systems, and computational science domains where faithful, context-sensitive quantitative reasoning is non-negotiable.

The observed failures—such as the inability to adjust predictions with counterfactual priors or to break from rigid parametric memory when tasked with synthetic physics—strongly motivate new VLM objectives and training paradigms. Examples include tighter integration of explicit geometric solvers, training on synthetic counterfactuals, or embedding differentiable physics simulators in model architecture.

QuantiPhy's design as a scalable, balanced, and diagnostic benchmark provides a substrate for such research, distinguishing true physical measurement from plausible guessing and enabling future model ablations, architectural innovations, and robust quantitative performance gains. Benchmark expansion to rotational/multi-body dynamics, non-rigid objects, and moving cameras is suggested as critical next steps.

Conclusion

QuantiPhy establishes a rigorous new standard for evaluating physical reasoning in vision-LLMs at the quantitative level. Assessment of 21 models reveals that current systems, including the strongest proprietary LMMs, are not yet capable of reliable, input-faithful quantitative physical inference. Advances in architecture, training, and data are essential to push VLMs beyond statistical plausibility toward true scientific and embodied intelligence. The benchmark and findings chart a clear experimental and methodological path for future work in physically grounded AI reasoning.

Whiteboard

Paper to Video (Beta)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.