Quantitative physical reasoning capability of state-of-the-art vision-language models

Determine whether state-of-the-art vision-language models can reason about physical properties quantitatively from video observations, specifically by inferring object kinematic quantities such as size, velocity, and acceleration in real-world units rather than merely qualitative judgments.

Background

The paper introduces QuantiPhy, a benchmark designed to measure whether vision-LLMs (VLMs) can perform numerically grounded physical reasoning from videos. Prior work has largely focused on qualitative, multiple-choice VQA settings, which cannot differentiate between near-correct and grossly incorrect numerical predictions.

By formalizing tasks around estimating size, velocity, and acceleration using provided physical priors and evaluating 21 models, the authors find that current systems often rely on pre-trained world knowledge rather than input-faithful quantitative inference. This motivates the unresolved question of whether contemporary VLMs truly possess quantitative physical reasoning capabilities when grounded in visual evidence.

References

However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively.

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models (2512.19526 - Puyin et al., 22 Dec 2025) in Abstract (page 1)