PhyFPS-Bench-Real: Real-Video FPS Benchmark

Updated 4 July 2026

The paper introduces PhyFPS-Bench-Real as a benchmark for recovering physical frame rate directly from visual dynamics using real videos with verified PhyFPS labels.
It employs controlled temporal resampling methods, including sharp capture, motion blur, and synthetic rolling shutter, to generate reliable lower-rate video clips.
The benchmark’s evaluation on 4,000 disjoint clips shows that specialized estimators like Visual Chronometer outperform generic vision models, highlighting issues like chronometric hallucination.

Searching arXiv for the benchmark paper and closely related work to ground the article. Searching arXiv for “Visual Chronometer Physical Frames Per Second PhyFPS-Bench-Real”. PhyFPS-Bench-Real is the real-video benchmark introduced in "The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics" for evaluating whether a model can recover the physical time scale of motion directly from appearance and dynamics rather than from container metadata (Gao et al., 15 Mar 2026). In that framework, the benchmark serves as the paper’s measurement-validity testbed: it contains real-world video clips with verified Physical FPS labels, and it is used to establish that Visual Chronometer can estimate the Physical Frames Per Second (PhyFPS) implied by motion itself. The benchmark is therefore distinct from a generator audit benchmark; its purpose is to validate the estimator on temporally trustworthy real videos before that estimator is used to diagnose chronometric hallucination in generated videos (Gao et al., 15 Mar 2026).

1. Conceptual definition and benchmark role

The benchmark is motivated by a distinction between PhyFPS and meta FPS. In the paper, Physical Frames Per Second (PhyFPS) is defined as the frame rate corresponding to the true physical passage of time implied by the visual motion itself, whereas meta FPS is the nominal playback or file-level frame rate stored in metadata. The benchmark exists because the paper argues that meta FPS can be wrong, inherited from post-processing, or otherwise uninformative about the physically sampled time scale of the scene (Gao et al., 15 Mar 2026).

Within that framing, PhyFPS-Bench-Real is the benchmark for single-video PhyFPS estimation from raw frames on real clips with trustworthy labels. The paper contrasts it explicitly with PhyFPS-Bench-Gen. The former uses real videos with verified labels to evaluate prediction accuracy, while the latter uses generated videos without ground-truth physical labels to audit alignment and stability through the estimator’s predictions. The operational distinction is central: PhyFPS-Bench-Real validates the instrument; PhyFPS-Bench-Gen uses the instrument (Gao et al., 15 Mar 2026).

The broader significance of this distinction is methodological. The paper’s claim that contemporary video generators suffer from chronometric hallucination depends on having an estimator that is accurate on real videos whose physical timing is known. PhyFPS-Bench-Real is the empirical basis for that claim.

2. Dataset construction and temporal trustworthiness

PhyFPS-Bench-Real is drawn from a curated corpus built under a strict criterion: only sources where the nominal capture rate is trusted to equal the physical sampling rate are used as the base material. The paper states that the data are curated “exclusively from video sources where the nominal metadata frame rate perfectly aligns with the real-world physical sampling rate (i.e., meta FPS = PhyFPS), strictly excluding videos with ambiguous post-hoc time-scale editing.” This exclusion principle is a defining property of the benchmark (Gao et al., 15 Mar 2026).

The benchmark is partitioned from a larger curated dataset assembled from five source categories:

Source category	Examples	Stated rationale
High-Frame-Rate Academic Datasets	Adobe240, BVI-VFI	Precise temporal analysis and frame interpolation
Raw Broadcast Sequences	UVG	Raw pipeline reduces hidden temporal remapping
Sensor-Synchronized Autonomous Data	NVIDIA, Honda	Camera/LiDAR/IMU synchronization is said to guarantee physical time-scale integrity
Physics-Grounded Human Motion	Mehta et al.	Included for biomechanical realism
Verified In-House Data	Internally collected video	Controlled capture settings with verified frame-rate metadata

The benchmark uses a strict cross-source split: training, validation, and test sets are derived from entirely disjoint video sources. This is one of its most consequential design choices, because it is intended to make test performance reflect generalization across capture domains rather than memorization of source-specific statistics or artifacts (Gao et al., 15 Mar 2026).

The paper gives two scale figures. The full dataset after augmentation contains 465,535 video clips, while the PhyFPS-Bench-Real test set contains 4,000 verified clips. All prepared clips are standardized to 128 frames. The distribution spans 18 target Physical Frame Rates:

$\{2, 5, 10, 12, 15, 18, 20, 24, 25, 30, 35, 40, 45, 50, 60, 90, 120, 240\}.$

The paper does not provide exact train/validation/test counts beyond the 4,000-clip test set, but it does specify the cross-source split principle and the target-rate coverage (Gao et al., 15 Mar 2026).

3. Controlled resampling, labels, and task formulation

Although PhyFPS-Bench-Real evaluates on real verified clips, the underlying dataset is expanded through controlled temporal resampling. Source videos are first upsampled to a common high-rate base with high-rate video $I^H$ at base frequency $F_H = 240$ FPS. For a target lower rate $F_L$ , the downsampling ratio is

$N = \frac{F_H}{F_L}.$

The paper then synthesizes lower-rate videos using three camera-physics-inspired strategies (Gao et al., 15 Mar 2026).

For Sharp Capture (fast shutter),

$I^{L}_{k}=I^{H}_{\lfloor kN \rfloor}.$

This is uniform temporal subsampling.

For Motion Blur (variable exposure),

$I^{L}_{k}=\frac{1}{M}\sum_{i=0}^{M-1} I^{H}_{\lfloor kN \rfloor + i},$

with

$M \in \{N, N/2, N/4\}.$

For Synthetic Rolling Shutter, the frame is partitioned into progressive bands and each spatial location is sampled at a shifted temporal index,

$\lfloor kN \rfloor + \left\lfloor M \cdot \frac{x}{W} \right\rfloor,$

for column $x$ and width $I^H$ 0, again with

$I^H$ 1

These augmentations are not benchmark labels in themselves; rather, they define how supervision is generated from temporally trustworthy source material. Label provenance is therefore derived from trusted source frame rates and known target PhyFPS induced by the controlled resampling pipeline. No human raters are described for assigning PhyFPS labels on PhyFPS-Bench-Real, and no model-derived pseudo-labels are used (Gao et al., 15 Mar 2026).

The benchmark task is continuous regression from a single video clip. The input is represented as

$I^H$ 2

and the model predicts a scalar in log-space,

$I^H$ 3

where $I^H$ 4 is predicted PhyFPS. The target is likewise normalized as

$I^H$ 5

Although training support values are sampled from discrete target rates, the paper explicitly frames the problem as absolute continuous regression of PhyFPS rather than classification or ranking (Gao et al., 15 Mar 2026).

4. Evaluation protocol and reported performance

PhyFPS-Bench-Real is evaluated with Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) over the 4,000-clip test set:

$I^H$ 6

Here, $I^H$ 7 is ground-truth PhyFPS and $I^H$ 8 is predicted PhyFPS. MAE measures error in FPS units; MAPE measures proportional error in percentage terms (Gao et al., 15 Mar 2026).

The average ground-truth PhyFPS over the test set is 38.81. The paper compares two variants of Visual Chronometer against several Vision-LLMs (VLMs) in both video-based and image-based prompting setups. The principal comparison is summarized below.

Method	Setup	Reported result
VC-Common	Proposed	Avg Pred 39.20, MAE 3.46, MAPE 9%
VC-Wide	Proposed	Avg Pred 45.48, MAE 7.76, MAPE 21%
Seed-1.6-Flash	Video-based VLM	Avg Pred 30.00, MAE 20.00, MAPE 40%
Gemini-3.1-Pro	Video-based VLM	Avg Pred 31.00, MAE 21.67, MAPE 43%
Qwen3.5+	Video-based VLM	Avg Pred 4.46, MAE 45.54, MAPE 91%
Seed-1.6-Flash	Image-based VLM	Avg Pred 30.00, MAE 20.00, MAPE 40%
Gemini-3-Flash	Image-based VLM	Avg Pred 1.77, MAE 48.23, MAPE 96%

The full table in the paper includes additional VLMs, namely Gemini-3.1-Pro, Gemini-3-Flash, Seed-1.6, Seed-1.6-Flash, Qwen3.5+, and Qwen3.5-397B, each in video-based and image-based variants. The strongest reported benchmark result is VC-Common: MAE 3.46, MAPE 9%. The broader-range VC-Wide is less precise, with MAE 7.76, MAPE 21%, which the paper interprets as the expected tradeoff from predicting over a larger frame-rate range (Gao et al., 15 Mar 2026).

The paper’s main conclusion from these results is that generic VLMs are not reliable PhyFPS estimators. It states that they “fail catastrophically,” often showing mode collapse. The explicit example is Seed-1.6-Flash, which predicts exactly 30 FPS for all inputs in one setup (Gao et al., 15 Mar 2026).

5. Ablations, inference regime, and benchmark-specific failure modes

The paper reports two ablation families directly on PhyFPS-Bench-Real. The first is a temporal augmentation ablation. Without motion blur or rolling shutter, a Naive Baseline yields MAE 5.12, MAPE 13%. Adding motion blur improves this to MAE 4.87, MAPE 11%. The full VC-Common configuration, which adds rolling shutter as well, achieves MAE 3.46, MAPE 9% (Gao et al., 15 Mar 2026).

This ablation is used to argue that camera-physics-aware augmentation is not incidental. The benchmark is specifically designed so that the predictor must recover physical timing under realistic capture effects rather than idealized frame subsampling. A plausible implication is that PhyFPS-Bench-Real measures not only motion-speed sensitivity but also robustness to acquisition artifacts that materially affect temporal cues.

The second ablation studies temporal context length with inference windows

$I^H$ 9

The base model, trained on maximum 32 frames, performs best at $F_H = 240$ 0 with MAE = 3.46, extrapolates well to $F_H = 240$ 1 and $F_H = 240$ 2, and benefits from post-training up to maximum length 128 at $F_H = 240$ 3 without harming short-context accuracy. The paper reports a tradeoff: clips that are too short contain insufficient motion evidence, while contexts that are too long reduce the benefit of sliding-window ensembling and weaken the ability to capture local PhyFPS fluctuation. The reported sweet spot is $F_H = 240$ 4 to $F_H = 240$ 5 (Gao et al., 15 Mar 2026).

The qualitative illustration associated with the benchmark shows a soccer-ball juggling action captured at 60, 24, and 12 PhyFPS, with the model said to recover the absolute time scale and maintain remarkable temporal stability across the entire sequence. In the paper’s logic, this supports the claim that the estimator is not merely regressing toward a dataset average (Gao et al., 15 Mar 2026).

6. Position within adjacent benchmark literature

PhyFPS-Bench-Real is specifically a benchmark for ground-truth Physical FPS estimation from real videos. It is therefore narrower than physics-aware editing benchmarks and different in objective from real-world scientific forecasting benchmarks. The paper most directly adjacent in topic is "PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing," which evaluates instruction-guided image editing under physically grounded transformations such as falling, splashing, deformation, fracture, diffusion, and phase change, using 238 real-world instances and 35 synthetic Anti-Physics instances (Guo et al., 25 Jun 2026). That benchmark addresses physical plausibility in edited images, whereas PhyFPS-Bench-Real addresses physical time-scale recovery from video dynamics.

A different neighboring line is represented by "RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data," which targets predicting the future evolution of complex physical systems under a sim-to-real regime using five datasets, three tasks, and real-world validation/test data (Hu et al., 5 Jan 2026). RealPDEBench is organized around spatiotemporal field forecasting and physics-oriented metrics such as fRMSE, FE, KE, and MVPE, whereas PhyFPS-Bench-Real is organized around temporal-scale estimation with MAE and MAPE on real clips.

These comparisons clarify the scope boundary. PhyFPS-Bench-Real is not a benchmark for editing quality, physical plausibility scoring, or PDE rollout forecasting. Its specific technical object is whether motion alone reveals a stable and correct real-world time scale, and whether an estimator can recover that quantity on a 4,000-clip cross-source test set (Gao et al., 15 Mar 2026).

In the paper’s overall argument, that makes PhyFPS-Bench-Real the empirical bridge between the concept of chronometric hallucination and the audit of generated videos. The workflow is explicit: build a benchmark of real videos with trustworthy physical time labels, show that Visual Chronometer predicts those labels accurately, then use the validated predictor to analyze generated videos. The benchmark’s headline result—VC-Common: 3.46 MAE and 9% MAPE—is what licenses the later claim that state-of-the-art generators suffer from severe PhyFPS misalignment and temporal instability (Gao et al., 15 Mar 2026).