PhyFPS-Bench-Gen: Temporal Calibration in Videos

Updated 4 July 2026

PhyFPS-Bench-Gen is a benchmark that evaluates physical frame-rate fidelity by comparing a video's nominal meta FPS with its intrinsic motion speed.
It employs a prompt-based evaluation with clip-level predictions from Visual Chronometer to assess meta-vs-PhyFPS alignment and both intra- and inter-video temporal stability.
Empirical results reveal widespread temporal miscalibration among modern video generators, highlighting the need for corrective re-timing to enhance perceptual naturalness.

PhyFPS-Bench-Gen is a benchmark for auditing whether state-of-the-art video generators produce motion at a physically consistent time scale rather than merely visually smooth frame-to-frame transitions. Introduced alongside Visual Chronometer in "The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics" (Gao et al., 15 Mar 2026), it targets what the authors term chronometric hallucination: the failure mode in which generated clips appear realistic while the actual motion speed implied by their visual dynamics does not match the nominal FPS metadata attached to the file. The benchmark is therefore centered on physical frame-rate fidelity, with three explicit evaluation targets: Meta-vs-PhyFPS alignment, intra-video temporal stability, and inter-video temporal stability.

1. Conceptual basis and evaluation target

PhyFPS-Bench-Gen was created to measure a blind spot in existing video-generation evaluation. Prior suites such as VBench and WorldScore assess temporal consistency, action alignment, and perceptual naturalness, but do not ask whether a generated clip adheres to a stable, correct physical frame rate. PhyFPS-Bench-Gen fills that gap by testing whether a model’s nominal FPS matches the intrinsic speed of its motion, whether that intrinsic speed remains stable across local windows of a single clip, and whether the same model produces similar intrinsic speeds across prompts and outputs (Gao et al., 15 Mar 2026).

The benchmark is organized around the distinction between meta FPS and Physical Frames Per Second (PhyFPS). Meta FPS is the nominal saved FPS taken from official documentation or from output metadata. PhyFPS is defined as the true frame rate implied by the visual motion itself, namely the time scale that best matches the real-world passage of time encoded in the dynamics. This distinction is central because current video generators are commonly trained on heterogeneous internet video corpora in which time scale is not explicitly modeled. A consequence, as formulated in the source paper, is that the same system may generate outputs whose intrinsic motion is effectively slower or faster than the container metadata suggests.

A common misconception addressed by the benchmark is that fluent motion implies correct timing. PhyFPS-Bench-Gen is designed specifically to expose the case where models are “slow but smooth”: motion appears coherent locally, yet is globally under-speeded relative to real-world dynamics. In this sense, the benchmark evaluates temporal calibration rather than generic smoothness.

2. Benchmark construction, prompt design, and model coverage

PhyFPS-Bench-Gen is a prompt-based evaluation suite rather than a web-collected generated-video dataset. The authors generate videos from a fixed prompt set using a wide range of modern text-to-video systems, then apply Visual Chronometer to estimate the PhyFPS of each output (Gao et al., 15 Mar 2026).

The prompt set contains 100 text-to-video prompts. These prompts are curated to avoid explicit speed cues such as “slow motion,” “time-lapse,” and “speed up.” Each prompt must include at least one clearly dynamic event so that motion speed is observable. The prompt set is balanced across five axes:

Primary entity: human, animal, vehicle, nature
Motion type: articulated, rigid-body, fluid, multi-agent
Camera behavior: static, pan, tracking
Environmental effects: rain, fire, wind
Scene context: indoor, urban, nature

The evaluated model set spans both open- and closed-source systems.

Open-source models:

Wan2.1-1.3B
Wan2.1-14B
Wan2.2-5B
Wan2.2-14B
LTX-Video
LTX-2
CogVideoX-2B
CogVideoX-5B
HunyuanVideo
InfinityStar (5s)
InfinityStar (10s)

Closed-source models:

Veo-3.1-Fast
Sora-2
Grok-Imagine-T2V
Kling-o3
Seedance-1.0-Lite
Seedance-1.5-Pro

All models are run with their default settings. The use of default settings is methodologically significant because it makes the benchmark an audit of deployed or recommended operating conditions rather than of hand-tuned special cases. A plausible implication is that the reported errors are intended to reflect the temporal calibration that practitioners are likely to encounter in ordinary use.

3. Measurement pipeline and the role of Visual Chronometer

The measurement engine underlying PhyFPS-Bench-Gen is Visual Chronometer, a regressor trained to infer $\log(\text{PhyFPS})$ directly from raw video frames (Gao et al., 15 Mar 2026). The model uses a VideoVAE+ backbone with an attention-based pooling head, and outputs a scalar $\hat{s}$ interpreted as the predicted logarithmic physical frame rate.

Its training objective is log-space MSE:

$\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$

where $y_i$ is the ground-truth PhyFPS and $\hat{s}_i = \log \hat{y}_i$ . The source paper notes that target FPS values are strictly positive, with $y_i \ge 2$ , so the usual MSLE $+1$ offset is omitted.

For PhyFPS-Bench-Gen specifically, the model variant used on generated videos is VC-Common. Given a generated video $v$ , the benchmark extracts overlapping clips of length $T=32$ frames with stride $s=4$ , predicts PhyFPS for each clip, and then aggregates these predictions to video-level and model-level estimates:

$\hat{s}$ 0

where $\hat{s}$ 1 is the predicted PhyFPS for clip $\hat{s}$ 2 in video $\hat{s}$ 3, $\hat{s}$ 4 is the number of clips in video $\hat{s}$ 5, and $\hat{s}$ 6 is the number of videos.

This clipwise protocol is important because PhyFPS-Bench-Gen is not limited to a single global estimate per video. It is explicitly designed to detect local fluctuations in effective motion speed, and therefore to separate global miscalibration from within-video instability.

4. Alignment and stability metrics

PhyFPS-Bench-Gen reports two families of metrics: alignment metrics, which compare intrinsic motion speed against nominal metadata, and stability metrics, which quantify variation within and across outputs (Gao et al., 15 Mar 2026).

For Meta-vs-PhyFPS alignment, the benchmark reports Avg. Error and Pct. Error:

$\hat{s}$ 7

For temporal stability, it reports Inter CV and Intra CV:

$\hat{s}$ 8

These metrics make the benchmark diagnostically sharper than evaluations that collapse all temporal behavior into a single realism score. A model may be stable but miscalibrated, or approximately aligned on average yet unstable across windows. The benchmark’s structure is expressly intended to distinguish these cases.

5. Empirical results on contemporary video generators

The benchmark’s principal quantitative results are reported as Meta FPS, predicted PhyFPS, Avg. Error, Pct. Error, Intra CV, and Inter CV. Representative values from Table 1 show substantial divergence between nominal and intrinsic frame rates across both open- and closed-source systems (Gao et al., 15 Mar 2026).

Model	Meta FPS $\hat{s}$ 9 PhyFPS	Key error summary
CogVideoX-2B	24 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 0 33.64	Avg. Error 12.46; Pct. Error 52
CogVideoX-5B	24 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 1 38.26	Avg. Error 17.96; Pct. Error 75
HunyuanVideo	24 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 2 35.89	Avg. Error 13.82; Pct. Error 58
Wan2.1-T2V-1.3B	24 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 3 26.28	Avg. Error 7.54; Pct. Error 31
LTX-Video	24 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 4 46.52	Avg. Error 23.67; Pct. Error 99
LTX-2	25 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 5 39.77	Avg. Error 15.70; Pct. Error 63
InfinityStar (5s)	16 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 6 34.41	Avg. Error 18.46; Pct. Error 115
InfinityStar (10s)	16 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 7 36.15	Avg. Error 20.19; Pct. Error 126
Seedance-1.0-Lite	24 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 8 28.60	Avg. Error 8.31; Pct. Error 35
Seedance-1.5-Pro	24 $\mathcal{L}_{\log} \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( \log y_i - \hat{s}_i \right)^2,$ 9 33.69	Avg. Error 10.67; Pct. Error 44
Sora-2	30 $y_i$ 0 36.21	Avg. Error 8.40; Pct. Error 28
Grok-Imagine-T2V	24 $y_i$ 1 36.97	Avg. Error 13.97; Pct. Error 58
Kling-o3	24 $y_i$ 2 30.04	Avg. Error 9.10; Pct. Error 38
Veo-3.1-Fast	24 $y_i$ 3 35.83	Avg. Error 13.62; Pct. Error 57

Several findings are emphasized. First, pervasive misalignment: most generators do not match their nominal FPS, and predicted PhyFPS is often markedly higher than meta FPS. Second, closed-source models are only somewhat better: the paper reports that they slightly outperform open-source models in absolute alignment, with average errors below 14 FPS and percentage errors under 60%, but this does not eliminate the problem. Third, temporal instability remains widespread: both intra-video and inter-video CV remain nontrivial even for commercial systems.

The paper also identifies an important special case. LTX-Video and LTX-2 show good stability metrics but high absolute mismatch. The authors explicitly suggest that this pattern may indicate miscalibrated metadata rather than unstable internal representations, and note that LTX-Video might simply need its meta FPS adjusted from 24 to about 46.5. This observation is significant because it shows that PhyFPS-Bench-Gen is not reducible to a penalty on instability; it can also diagnose systematic calibration error.

Across almost all models, predicted PhyFPS exceeds assigned meta FPS. The authors interpret this as a general trend toward under-speeded outputs, namely videos that look smooth but run effectively slower than the visual dynamics imply. In the vocabulary of the paper, this is the characteristic signature of “slow but smooth” generation.

6. Validation against real data and practical remediation

PhyFPS-Bench-Gen is paired with PhyFPS-Bench-Real, a companion benchmark used to validate whether Visual Chronometer estimates true PhyFPS rather than merely learning generator-specific artifacts (Gao et al., 15 Mar 2026). PhyFPS-Bench-Real contains 4,000 verified test clips, is drawn from sources curated so that meta FPS = PhyFPS for the training data, and uses a strict cross-source split in which train, validation, and test sets come from disjoint video sources.

On this real-data benchmark, the paper reports:

VC-Common: Avg Pred 39.20, MAE 3.46, MAPE 9
VC-Wide: Avg Pred 45.48, MAE 7.76, MAPE 21

The average ground-truth PhyFPS of the test set is 38.81. The paper also compares against video-based and image-based VLM baselines, which perform substantially worse; examples given are Gemini-3.1-Pro video-based with MAE 21.67 and MAPE 43, Seed-1.6-Flash video-based with MAE 20.00 and MAPE 40, and Qwen3.5+ video-based with MAE 45.54 and MAPE 91. This validation matters because it supports the interpretation that PhyFPS-Bench-Gen is measuring a real temporal property rather than an arbitrary proxy.

The benchmark also has a practical post-processing use. The authors estimate a generated video’s effective PhyFPS using VC-Common and then retime the video to better match that intrinsic speed. They evaluate three variants in a human study:

Original: unmodified generator output
Pred: globally corrected to the video’s average predicted PhyFPS
Pred Dyn: locally corrected segment by segment using clipwise predictions

The user study comprises 1,490 pairwise comparisons, involves more than 15 participants, uses Bradley–Terry analysis, and reports 90% confidence intervals via bootstrapping. The resulting preferences are:

Original: 19.0%
Pred: 44.2%
Pred Dyn: 36.9%

Both corrected variants are significantly preferred over the original outputs. The source paper further notes that the global correction is preferred over the dynamic local correction, attributing this to the fact that changing playback rate within one short sequence may introduce perceptual jitter, whereas a constant corrected rate feels smoother and more natural.

PhyFPS-Bench-Gen therefore has a dual role. It is an evaluation instrument for temporal calibration and stability, and it is also a remediation instrument that provides the measurements needed to retime generated videos in a way that improves perceived naturalness. This suggests a broader significance: if video generators are to function as world models, temporal fidelity must be treated as a first-class evaluation target alongside appearance, semantic alignment, and generic temporal consistency.

Markdown Report Issue Upgrade to Chat

References (1)

The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PhyFPS-Bench-Gen.