QuantiPhy: Quantitative Reasoning in VLMs
- QuantiPhy is a benchmark for quantitative physical reasoning that evaluates vision-language models’ capacity to infer kinematics from videos using a numerical prior.
- It formalizes scale-inference by converting pixel-space measurements to world units via a scalar factor, bridging the gap between qualitative plausibility and numerical accuracy.
- The benchmark covers diverse 2D and 3D scenarios, emphasizing challenges such as temporal fidelity, counterfactual priors, and precise kinematic scaling.
Searching arXiv for the QuantiPhy paper and closely related benchmark context. Found the QuantiPhy benchmark paper on arXiv and verified its metadata for citation use: (Puyin et al., 22 Dec 2025). QuantiPhy is a benchmark for quantitative physical reasoning in vision-LLMs (VLMs). It evaluates whether a model can infer object kinematics from video by producing numerical estimates in real-world units for size, velocity, and acceleration, given one physical prior as text. The benchmark was introduced to address a gap in prior physical reasoning evaluation, which was described as predominantly VQA-based and qualitative, and it reports that contemporary VLMs often generate physically plausible answers while failing to reason faithfully and numerically from the actual video and explicit priors (Puyin et al., 22 Dec 2025).
1. Conceptual scope and task definition
QuantiPhy formalizes quantitative physical reasoning as a scale-inference problem. Given a video and a single prior for a source object drawn from size, velocity at time , or acceleration at time , the model is asked to infer requested kinematic properties of a target object in world space. The source and target objects may be identical or distinct. The benchmark distinguishes pixel space, measured in , , and , from world space, measured in , , and (Puyin et al., 22 Dec 2025).
For a fixed-camera video with object pixel position , QuantiPhy uses finite-difference kinematics in pixel space:
0
A scalar factor 1 maps pixel-space quantities to world-space quantities: 2 This makes the benchmark’s central question explicit: whether a VLM can infer the relevant pixel-space quantity from video, recover 3 from the given prior, and rescale to obtain the requested world-space answer.
The benchmark is restricted to translational motion. It does not include rotational dynamics, dynamic camera motion, rich multi-body interaction as a central benchmark component, or deformable-body quantitative reasoning as a core benchmark target. This delimitation is important because QuantiPhy is designed as a focused test of kinematic grounding rather than a general physics benchmark.
2. Dataset composition and benchmark taxonomy
QuantiPhy contains 569 unique videos and 3355 video-text instances or questions, and is also described as comprising more than 3.3K video-text instances. Typical videos are 2–3 seconds long, with total storage of about 115 MB. Each dataset item is a triplet
4
A single video can support multiple question-prior pairs (Puyin et al., 22 Dec 2025).
The dataset combines three principal sources. Blender simulation provides controlled and physically measurable scenes, including both 2D and 3D motion, and supports unusual scales such as microscopic or astronomical scenes. Lab captures use multi-view stereo and depth-based annotation, and include free fall, sliding down slopes, pendulum motion, and bouncing, with rigid, rollable, and some deformable objects. Internet scraping and author-recorded real videos are selected for static camera and usable physical references, but are restricted to 2D inference because calibrated depth is unavailable.
An additional segmented subset is produced using SAM 2 to isolate moving objects on plain backgrounds. According to the benchmark description, this creates denoised variants and effectively doubles the dataset without additional annotation, enabling controlled experiments on background complexity.
QuantiPhy is organized along two main axes: dimensionality and prior type. Dimensionality is split into 2D, where motion lies in the image-parallel plane and depth is approximately constant, and 3D, where depth variation is present and an additional depth prior is supplied. Prior type is split into static prior, consisting of object size 5, and dynamic prior, consisting of velocity 6 or acceleration 7. These axes define four benchmark categories: 2D-Static (2S), 2D-Dynamic (2D), 3D-Static (3S), and 3D-Dynamic (3D).
The appendix further defines a 4-character video code comprising prior type 8, dimension 9, object setting 0 for single versus multiple relevant objects, and background 1 for plain, simple, and complex scenes. This yields 36 fine-grained categories, all represented in the dataset. The video counts by dimensionality are 328 for 2D and 241 for 3D. The counts by source are 300 Blender videos, 72 Internet videos, 112 captured videos, and 85 segmented videos, which sum to 569.
3. Construction of tasks, ground truth, and scoring
QuantiPhy constructs tasks over size, velocity, and acceleration, always with one among these quantities given as prior. In size tasks, a world-space size prior is supplied, such as the length of a car, and the model may be asked to estimate another extent of the same object, another object’s size, or minimum or maximum size under articulated motion. In velocity tasks, a velocity prior at a timestamp 2 is provided, and the model must infer another object’s speed at a different time or use the prior object as a scale reference. In acceleration tasks, an acceleration prior at a timestamp 3 is provided, after which the model may be asked for another object’s acceleration or another quantity derivable once the scale is fixed. The benchmark description states explicitly that acceleration is the hardest case because it requires second-order temporal differencing and is sensitive to frame rate, blur, and noise (Puyin et al., 22 Dec 2025).
Ground truth is source-dependent. In Blender, most object sizes are extracted directly from world-space dimensions, typically via axis-aligned bounding box extents. For articulated humans, the benchmark records rest standing height, minimum walking height, and maximum walking height; for flying animals, it records minimum and maximum width during flight. In lab data, sizes are physically measured with rulers or calipers before filming. In Internet data, annotators manually measure pixel sizes and use a known physical reference to convert to metric scale.
Velocity ground truth is based on displacement over time: 4 For Blender, per-frame world-space positions are read from object transforms. For lab data, object centers are reconstructed into world coordinates from clicked image points plus metric depth. For Internet data, the benchmark computes pixel-space velocities and converts them using the scale factor 5.
Acceleration ground truth is defined as
6
In Blender, scalar acceleration is computed from framewise speeds; lab and Internet settings use second-order finite differences in world or pixel coordinates, respectively, with Internet accelerations rescaled to world space via 7.
For 2D and Internet settings, the appendix gives the scale factor explicitly: 8 Once 9 is known, target size, velocity, and acceleration follow by direct rescaling. The benchmark is therefore structured around an intended reasoning pipeline: measure a pixel-space quantity from video, infer scale from the supplied prior, apply the scaling relation, and output the requested world-space quantity.
The primary metric is Mean Relative Accuracy (MRA). For prediction 0 and ground truth 1, relative error is
2
The main paper defines confidence thresholds
3
and
4
The appendix gives a different threshold set,
5
while retaining the same functional form. The benchmark description identifies this as an inconsistency between the main body and appendix.
Category-level scoring averages MRA over valid numerical answers in a category, and the overall score is the unweighted mean across the four main categories 6. Output parsing is strict: each question may be retried up to 5 times, stopping as soon as a parseable number appears; if no valid number is produced after 5 attempts, the question is treated as failed. The parser checks exact numeric output, searches for delimiters such as =, Final Answer:, and Answer:, strips units, regex-extracts numbers, and uses the last valid number.
4. Prompting protocol and evaluation setup
A defining feature of QuantiPhy is a standardized prompting protocol intended to make model comparison fair. The input is programmatically structured as
7
with all frames retained and videos normalized to 480p. The benchmark explicitly prioritizes temporal fidelity over higher spatial resolution (Puyin et al., 22 Dec 2025).
The system prompt uses a persona-like instruction, including the form “You are an expert video analyst...”. Ground-truth prior information injects exactly one prior—size, velocity, or acceleration—and, for 3D scenes, depth information as well. The question then requests a target quantity, and the post-prompt enforces output format, for example by requiring “Provide ONLY the numerical answer with units” and “No explanation or reasoning needed.” The prompts instruct models to “analyze the video and calculate the answer carefully” and to “output only the numerical answer and unit.”
QuantiPhy evaluates 21 state-of-the-art VLMs: 6 proprietary models and 15 open-weight models. The proprietary set comprises ChatGPT-5.1, ChatGPT-5, Gemini-2.5 Flash, Gemini-2.5 Pro, Grok-4.1 (Fast Reasoning), and Claude-4.5 Sonnet. The open-weight set comprises Qwen3-VL-Instruct-32B, InternVL-3.5-30B, Qwen3-VL-Instruct-8B, InternVL-3.5-8B, Molmo-7B, Phi-4-Multimodal-Instruct, Qwen3-VL-Instruct-2B, SmolVLM-Instruct, InternVL-3.5-2B, VILA-7B, CogVLM2 Video, Phi-3-Mini-128K-Instruct-3.8B, LLaVA-13B, MiniCPM-V 4.5, and Fuyu-8B.
The evaluation protocol standardizes video resolution, prompt template, answer parsing, retry policy, MRA computation, and the overall-score definition. Decoding temperatures are usually 0 to 0.1. OpenAI models are allowed up to 10,000 tokens, while open models generally use 500–2048 tokens. The paper presents this uniform protocol as an explicit contribution because it reduces confounds from prompt variation and answer-format handling.
Several controlled analysis conditions are also included. One axis analyzes background complexity via six scene types 8, encoding single versus multiple moving objects and plain, simple, or complex backgrounds. Another compares the default video-plus-prior condition with a prior-only ablation. A counterfactual-prior analysis replaces the given prior by a scaled version with
9
for which faithful quantitative reasoning would imply
0
Finally, the paper tests a chain-of-thought decomposition into four explicit steps: source property in pixels, proportional relationship between pixels and kinematic scale, target property in pixels, and target property in real-world units.
5. Quantitative results and comparative performance
The main results are reported as MRA percentages for the four benchmark categories and their average. The human baseline is 50.0 on 2S, 59.1 on 2D, 55.2 on 3S, 57.9 on 3D, and 55.6 on average. Among evaluated models, the best overall score is achieved by ChatGPT-5.1 with 53.1 average MRA, while the best open-weight result is Qwen3-VL-Instruct-32B with 46.0 average MRA. No evaluated model exceeds the overall human average, although ChatGPT-5.1 slightly exceeds the human score on 2D-Dynamic (Puyin et al., 22 Dec 2025).
| System | Avg MRA | Note |
|---|---|---|
| Human baseline | 55.6 | Highest overall |
| ChatGPT-5.1 | 53.1 | Best model overall |
| Gemini-2.5 Pro | 49.6 | Best non-OpenAI proprietary result listed |
| Gemini-2.5 Flash | 48.6 | Strong proprietary result |
| Qwen3-VL-Instruct-32B | 46.0 | Best open-weight model |
The proprietary-model results reported in the benchmark are: ChatGPT-5.1 at 1 across 2; Gemini-2.5 Pro at 3; Gemini-2.5 Flash at 4; Grok-4.1 at 5; ChatGPT-5 at 6; and Claude Sonnet 4.5 at 7. The open-weight results show a broad spread, from Qwen3-VL-Instruct-32B at 46.0 average to Fuyu-8B at 12.5 average.
Within model families, the benchmark reports clear scaling trends. For Qwen3-VL, the average MRA rises from 29.0 at 2B to 38.8 at 8B and 46.0 at 32B. For InternVL, it rises from 25.0 at 2B to 35.4 at 8B and 40.7 at 30B. These results show that scaling improves performance substantially, especially on dynamic categories, but does not close the gap to the human baseline or to the strongest proprietary systems.
The four categories are indexed by dimensionality and prior type rather than by target quantity directly, but the benchmark narrative states that static categories center on size priors, dynamic categories on velocity or acceleration priors, and dynamic categories typically score higher than static categories for the best models. The paper also states that 3D tasks remain difficult despite the provision of depth priors. This suggests that temporal cues are somewhat more usable than precise geometric calibration for current models, although the benchmark does not present that as a formal theorem.
6. Failure modes, interpretive conclusions, and limitations
QuantiPhy’s most prominent analytical result is the gap between qualitative plausibility and quantitative correctness. In the video-plus-prior, prior-only, counterfactual-prior, and chain-of-thought analysis subset, several models retain surprisingly strong performance when the video is removed. ChatGPT-5.1 scores 56.1 with video plus prior but 39.0 with prior only; Gemini-2.5 Pro scores 60.9 versus 46.1; Grok-4.1 scores 47.5 versus 44.3. The benchmark interprets this as evidence that models can often rely on object-category semantics, memorized typical sizes, and generic physical expectations rather than precise visual measurement (Puyin et al., 22 Dec 2025).
Counterfactual priors produce a much sharper degradation. The benchmark states that if models were doing faithful quantitative reasoning, replacing the prior with 8-scaled values should produce correspondingly scaled answers. Instead, performance collapses: most models lose about 80%, and the strongest models still lose about 70%. This is presented as one of the benchmark’s strongest findings, indicating that models do not reliably obey the supplied numerical prior.
Chain-of-thought prompting also fails to provide a general remedy. Only a few models improve with structured decomposition; most perform worse. The benchmark attributes this to brittleness in intermediate numeric subproblems and error propagation across steps. This result is notable because it distinguishes verbal decomposition from actual grounded measurement ability.
Background and scene context produce subtler effects. SAM-denoised plain backgrounds help slightly relative to simple textured scenes, but complex scenes often outperform simpler ones. The paper hypothesizes that realistic backgrounds may provide extra metric cues, such as tiles, windows, road markings, and other structures with known scale. Multiple-object scenes consistently outperform single-object scenes, which the paper attributes to additional reference standards and relational cues. This suggests that context is not merely a source of noise; in some cases it is an informational scaffold for quantitative inference.
The appendix’s case studies on ChatGPT-5.1 illustrate the intended and failed modes of reasoning. One case follows the desired pipeline: identifying relevant frames, measuring object dimensions in pixels, computing a pixel-to-meter scale from the prior, and rescaling the target dimension. Other cases show failure: implausible counterfactual priors are rejected in favor of “typical” object knowledge, no-video ablations still produce reasonable estimates from semantic priors, and in a simulation with nonstandard acceleration near 9, the model defaults to Earth gravity 0 and derives 1 despite contradictory video evidence.
The paper’s overarching conclusion is that current VLMs have not established a reliable link between visual observations and quantitative physical facts. It identifies two dominant failure modes: weak reliance on the actual video and weak obedience to explicit numerical priors. QuantiPhy is presented as important because it moves evaluation beyond coarse VQA correctness into a regime where error magnitude matters. The benchmark description explicitly connects this to embodied AI, robotics, autonomous driving, AR/VR, and evaluation of physical realism in generated videos.
The stated limitations are equally explicit. The benchmark does not yet include rotational motion, moving cameras, rich deformable-object dynamics, more complex multi-body interactions, or fully unconstrained real-world scenes. It also notes that some Internet-video annotations are less precise than simulation or lab annotations because of monocular constraints. The future directions proposed in the paper include broader physics coverage, physics-informed objectives, explicit numerical supervision, training on physics-rich video corpora, stronger grounding between visual tokens and geometric measurements, and hybrid systems that combine VLMs with specialized tools for tracking, segmentation, geometry estimation, and finite differencing. A plausible implication is that QuantiPhy is meant not only as an evaluation benchmark but also as a diagnostic instrument for separating verbal physical plausibility from numerically grounded physical reasoning.