Frechet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos
Recent trends in video generation have focused on enhancing the quality and temporal coherence of generated content. Unlike static image generation, video generation entails a higher degree of complexity, necessitating not only visual fidelity in individual frames but also seamless temporal continuity across them. The paper by Liu et al. introduces the Fréchet Video Motion Distance (FVMD), a novel evaluation metric specifically designed to measure motion consistency in generated videos.
Background and Motivation
With the advent of advanced generative models, such as diffusion models, the capability to generate high-quality videos has markedly improved. However, the evaluation of these videos has predominantly relied on metrics like FID-VID, FVD, and VBench, which either overlook temporal coherence or fail to effectively capture complex motion patterns in dynamically generated content. For instance, while FVD utilizes an action recognition model to evaluate temporal coherence, it does not prioritize the intricate motion patterns central to tasks like motion-guided video generation. VBench, despite its comprehensive approach, tends to penalize videos with notable dynamic motion. This gap motivates the need for a dedicated metric that harmonizes visual fidelity and motion consistency.
Proposed Metric: Fréchet Video Motion Distance (FVMD)
The central contribution of the paper is the FVMD metric, which evaluates the motion consistency in videos by leveraging the Fréchet distance, applied to motion features derived from key point tracking.
Methodology
- Motion Feature Extraction:
- Key points in videos are tracked using the PIPs++ model, an advanced key point tracking approach accommodating occlusions and complex movements.
- For each video frame, velocity and acceleration fields are computed to capture the changes in motion patterns. These fields offer a detailed representation of motion, encapsulating the physical properties of generated movements.
- Statistical Representation:
- The computed velocity and acceleration fields are transformed into histograms. Two types of histograms are used: quantized 2D histograms and dense 1D histograms. The latter, inspired by the HOG approach, quantizes the motion vectors based on their magnitudes and angles.
- Distance Calculation:
- The similarity between generated videos and ground-truth videos is measured using the Fréchet distance applied to the extracted motion features. The use of statistical histograms ensures a robust comparison, considering the inherent variability in video content.
Empirical Evaluation
The metric was subjected to rigorous validation, encompassing sensitivity analysis and alignment with human judgment.
The metric's capability to detect temporal inconsistencies was tested by injecting various types of noise (e.g., local swaps, global swaps) into real videos. FVMD demonstrated superior sensitivity in capturing these discrepancies, particularly when utilizing combined velocity and acceleration features with dense 1D histograms.
A large-scale human paper was conducted to compare FVMD with existing metrics. Over 200 raters evaluated videos generated by different models. The FVMD consistently exhibited higher correlation with human judgments compared to FID-VID, FVD, SSIM, PSNR, and VBench. This indicates its robustness in reflecting human-perceived video quality.
Implications and Future Work
The introduction of FVMD has several practical and theoretical implications:
FVMD can be employed as a reliable metric for evaluating the quality of videos generated by diverse models. It offers a nuanced approach to assessing temporal coherence, pivotal for applications in entertainment, virtual reality, and video editing.
- Theoretical Implications:
The research underscores the importance of motion consistency in video generation and encourages further exploration into physical laws embedded in motion patterns. Future work can aim to refine motion representations, ensuring that generated movements adhere to plausible physical dynamics.
Conclusion
The Fréchet Video Motion Distance presents a substantial advancement in the evaluation of video generative models. By focusing on the intricacies of motion consistency, FVMD fills a critical gap left by previous metrics, aligning more closely with human perception and enhancing the credibility of video quality assessments. Moving forward, integrating more sophisticated motion representations could further elevate the standards of generative video evaluation.