Direct Motion Models for Assessing Generated Videos (2505.00209v1)

Published 30 Apr 2025 in cs.CV and cs.LG

Abstract: A current limitation of video generative video models is that they generate plausible looking frames, but poor motion -- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: http://trajan-paper.github.io.

Summary

The paper introduces TRAJAN, a transformer-based autoencoder that models 2D video motion using point tracks to capture temporal inconsistencies.
It demonstrates superior sensitivity to temporal artifacts compared to traditional appearance-based metrics through rigorous experiments.
TRAJAN offers practical evaluation metrics like the Average Jaccard score and latent comparisons, aligning closely with human judgments of motion quality.

This paper introduces TRAJAN (TRAJectory AutoeNcoder) (2505.00209), a novel approach for evaluating the motion quality of generated videos. The authors argue that existing metrics like Fréchet Video Distance (FVD) are often biased towards frame-level appearance and fail to adequately capture temporal inconsistencies and unrealistic motion, a common failure mode of current video generative models. Human judgment is the current gold standard but is expensive and not scalable for frequent evaluation during model development.

TRAJAN addresses this gap by directly modeling 2D video motion using point tracks. The core idea is to extract low-level, temporally extended motion features from videos using a pre-trained point tracking model (specifically, they use BootsTAPIR (2505.00209)). These point tracks, which represent the trajectory of specific points across frames, inherently focus on motion rather than pixel-level appearance.

The TRAJAN model is a transformer-based autoencoder designed to process sets of these point tracks. \begin{itemize} \item Point Track Representation: A point track for a specific point $j$ across time $t$ is represented by its $(x_{t,j}, y_{t,j})$ coordinates and an occlusion flag $o_{t,j}$ indicating if the point is visible at that time. A video is represented as an orderless set of such tracks. \item Encoder: The encoder takes this set of point tracks and maps it to a fixed-size latent representation $\phi_S$ . It uses sinusoidal embeddings for spatial-temporal coordinates, adds a learned "readout" token for each track, and employs self-attention masked by the occlusion flags to make the representation invariant to occlusions. A Perceiver-style cross-attention mechanism is then used to aggregate information from all individual track representations into a set of 128 latent tokens, which are finally projected to a $128 \times 64$ dimensional representation $\phi_S$ . The design ensures permutation invariance to the order of input tracks. \item Decoder: The decoder takes the latent $\phi_S$ and a query point $(x_q, y_q, t_q)$ and aims to reconstruct the full track $(x^q_t, y^q_t, o^q_t)$ that passes through the query point at time $t_q$ . This allows the model to learn a dense motion field, enabling it to reconstruct tracks even if the query point was not part of the input tracks used to derive $\phi_S$ . The decoder uses an up-projection of $\phi_S$ , adds a positional encoding for the query point, applies attention, and linearly projects to predict positions and occlusion probabilities. \item Training: TRAJAN is trained to reconstruct held-out point tracks that were not included in the encoder's input, using Huber loss for position and sigmoid cross-entropy for occlusion. This objective encourages the latent space to capture the underlying dense motion field independent of the specific sampled input points. Training is performed on a large dataset of real videos. \end{itemize}

TRAJAN provides several ways to evaluate generated videos:

Distributional Comparison: TRAJAN latent representations ( $\phi_S$ ) can be used in metrics like Fréchet Distance or Maximum Mean Discrepancy to compare the motion distribution of a set of generated videos against a reference set (e.g., real videos).
Paired Video Comparison: The $L_2$ distance between the TRAJAN latents ( $\phi_S$ ) of a generated video and a corresponding real video can serve as a similarity metric for motion.
Per-Video Evaluation: The autoencoder's ability to reconstruct the input point tracks can be used as an ordinal quality score for a single video, without needing a reference. The Average Jaccard (AJ) metric, which measures the positional and occlusion accuracy of the reconstructed tracks relative to the input, is used for this purpose. Lower AJ indicates poorer reconstruction, suggesting unrealistic motion relative to the model's training on real videos.

The paper evaluates TRAJAN against several alternative metrics, including appearance-based (I3D, VideoMAE-v2, MooG prediction error) and motion-based (Motion Histograms, RAFT optical flow warp error).

Experimental Results:

Distributional Sensitivity: On the UCF-101 dataset with synthetic temporal distortions, TRAJAN (using Fréchet Distance) is significantly more sensitive to temporal artifacts than appearance-based methods and Motion Histograms. The per-video TRAJAN reconstruction score also shows strong sensitivity.
Paired Video Comparison: When comparing WALT-generated videos to real videos, TRAJAN latent distances show lower correlation with pixel-based metrics (PSNR, SSIM) than appearance-based methods (VideoMAE, I3D). This indicates that TRAJAN effectively distinguishes differences in motion even when visual details or scene content differ, which is crucial for evaluating predictive video generation where future frames are inherently uncertain.
Per-Video Evaluation and Human Alignment: A detailed human paper on videos from EvalCrafter and VideoPhy datasets was conducted, asking raters about motion consistency, appearance consistency, realism, and interactions. TRAJAN's Average Jaccard score consistently shows the best correlation with human judgments across these categories compared to alternative automated metrics. Point track statistics derived from the initial point tracks (lengths, radii) also correlate well with human perceptions of object and camera speed.
Spatiotemporal Localization: A key advantage of TRAJAN is its ability to localize errors. By analyzing the Average Jaccard score per point and per frame, inconsistencies in generated videos can be pinpointed in space and time, providing interpretability into failure modes (e.g., tracking points on a morphing object show low AJ).

Practical Implementation and Application:

To implement and apply TRAJAN for evaluating generated videos:

Obtain Point Tracks: Utilize a robust point tracking model like BootsTAPIR (2505.00209) to extract point trajectories from the generated video. This involves identifying points in the first frame and tracking them across subsequent frames. The output should be a set of $(x, y)$ coordinates and an occlusion flag for each tracked point at each timestep.
Load or Train TRAJAN: Obtain the pre-trained TRAJAN model weights. The paper mentions training on a large dataset of real videos (lifestyle, one-shot videos without cuts, 60fps, 150-frame clips). If a pre-trained model is not available, training requires a substantial collection of real videos and significant computational resources.
Process with TRAJAN:
- Input: The TRAJAN model takes batches of point tracks as input. The tracks should be formatted as described in the paper (e.g., $(x, y, o)$ for each time step). The model can handle variable clip lengths by effectively masking out points beyond the actual video length.
- Inference: Pass the point tracks through the TRAJAN model. This will yield the fixed-size latent representation $\phi_S$ from the encoder and the reconstructed point tracks and occlusion flags from the decoder based on sampled query points.
Calculate Metrics:
- Per-Video Quality: Calculate the Average Jaccard (AJ) metric between the input point tracks and the reconstructed point tracks output by the decoder. The AJ is averaged over several pixel thresholds (e.g., 1, 2, 4, 8, 16 pixels after resizing to 256x256). Lower AJ implies poorer reconstruction and thus worse motion quality/consistency according to TRAJAN.
- Distributional Comparison: Collect the latent vectors $\phi_S$ for a set of generated videos and a set of reference videos (e.g., real videos). Compute the Fréchet Distance (FD) or Maximum Mean Discrepancy (MMD) between the distributions of these latent vectors. Standard implementations for FD/MMD (assuming Gaussian distribution for FD) can be used.
- Paired Comparison: Obtain the latent vectors $\phi_S$ for a generated video and its corresponding real video. Compute the $L_2$ distance between these two vectors.
Analyze Results: Use the computed metrics to compare different generative models, track training progress, or identify specific low-quality videos. For per-video analysis, the frame-wise and point-wise AJ scores can help localize inconsistencies visually.

Implementation Considerations:

Point Tracker Choice: The quality of the motion evaluation heavily relies on the quality of the underlying point tracking. BootsTAPIR is highlighted for its robustness.
Computational Cost: Point tracking, especially dense tracking required for generating sufficient input points for TRAJAN, can be computationally expensive. Running the TRAJAN transformer model also requires GPU resources.
Data Requirements: Training TRAJAN requires a large and diverse dataset of real videos with realistic motion.
Thresholding: For AJ and point track statistics like length/radii, appropriate handling of occluded points is necessary (masking out or using occlusion flags).
Limitations: TRAJAN is trained on real motion patterns. It may not capture subtle or novel failure modes that still result in smooth but physically impossible motion (e.g., objects merging unnaturally), as highlighted by the human paper results. The correlation with human judgment, while better than alternatives, is not perfect, and human evaluation itself can be inconsistent.

Overall, TRAJAN offers a principled and effective method for evaluating the critical aspect of motion quality in generated videos, providing metrics that align better with human perception than prior approaches and enabling more nuanced analysis through spatiotemporal error localization.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (12)

GitHub

Tweets

https://twitter.com/CSVisionPapers/status/1918242149654548912