Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Content Bias in Fréchet Video Distance (2404.12391v1)

Published 18 Apr 2024 in cs.CV, cs.GR, and cs.LG

Abstract: Fr\'echet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. On the robustness of quality measures for gans. In ECCV, 2022.
  2. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  3. Stochastic variational video prediction. In ICLR, 2018.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
  5. Lumiere: A space-time diffusion model for video generation, 2024.
  6. Is space-time attention all you need for video understanding? In ICML, 2021.
  7. Demystifying mmd gans. In ICLR, 2018.
  8. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023a.
  9. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
  10. Ali Borji. Pros and cons of gan evaluation measures: New developments. Computer Vision and Image Understanding, 215:103329, 2022.
  11. Generating long videos of dynamic scenes. NeurIPS, 35:31769–31781, 2022.
  12. Video generation models as world simulators, 2024.
  13. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  14. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
  15. Improved conditional vrnns for video prediction. In ICCV, pages 7608–7617, 2019.
  16. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. In NeurIPS, 2019.
  17. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  18. Stochastic video generation with a learned prior. In ICML, pages 1174–1183. PMLR, 2018.
  19. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982.
  20. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022.
  21. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
  22. Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023.
  23. The ”something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
  24. Reuse and diffuse: Iterative denoising for text-to-video generation, 2023.
  25. Photorealistic video generation with diffusion models, 2023.
  26. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  27. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019.
  28. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  29. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  30. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  31. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
  32. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In CVPR, 2018.
  33. Vbench: Comprehensive benchmark suite for video generative models, 2023.
  34. Video prediction with appearance and motion conditions. In ICML, pages 2225–2234, 2018.
  35. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  36. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In CVPR, 2023.
  37. Videopoet: A large language model for zero-shot video generation, 2024.
  38. Improved precision and recall metric for assessing generative models. In NeurIPS, 2019.
  39. The role of imagenet classes in fréchet inception distance. In ICLR, 2022.
  40. Quality assessment of in-the-wild videos. In Proceedings of the 27th ACM International Conference on Multimedia, pages 2351–2359, 2019.
  41. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer, 2022.
  42. Resound: Towards action recognition without representation bias. In ECCV, 2018.
  43. No frame left behind: Full video action recognition. In CVPR, 2021.
  44. Evalcrafter: Benchmarking and evaluating large video generation models, 2023.
  45. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
  46. On self-supervised image representations for gan evaluation. In ICLR, 2020.
  47. Reliable fidelity and diversity metrics for generative models. In ICML, pages 7176–7185. PMLR, 2020.
  48. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  49. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  50. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
  51. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  52. FaceForensics++: Learning to detect manipulated facial images. In ICCV, 2019.
  53. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  54. Temporal generative adversarial nets with singular value clipping. In ICCV, 2017.
  55. Assessing generative models via precision and recall. In NeurIPS, 2018.
  56. Improved techniques for training gans. In NeurIPS, 2016.
  57. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  58. Study of subjective and objective quality assessment of video. IEEE transactions on Image Processing, 19(6):1427–1441, 2010.
  59. Only time can tell: Discovering temporal data for temporal modeling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 535–544, 2021.
  60. Mostgan-v: Video generation with temporal motion styles. In CVPR, 2023.
  61. First order motion model for image animation. In NeurIPS, 2019.
  62. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  63. Large-scale study of perceptual video quality. IEEE Transactions on Image Processing, 28(2):612–627, 2018.
  64. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2022.
  65. Denoising diffusion implicit models. In ICLR, 2021.
  66. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  67. Unsupervised learning of video representations using lstms. In ICML, 2015.
  68. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  69. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  70. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 35:10078–10093, 2022.
  71. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
  72. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  73. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
  74. Generating videos with scene dynamics. NeurIPS, 2016.
  75. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  76. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023a.
  77. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023b.
  78. Videolcm: Video latent consistency model, 2023c.
  79. G3an: Disentangling appearance and motion for video generation. In CVPR, 2020.
  80. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023d.
  81. Video quality assessment based on structural distortion measurement. Signal processing: Image communication, 19(2):121–132, 2004.
  82. Scaling autoregressive video models. In ICLR, 2019.
  83. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  84. Dynamicrafter: Animating open-domain images with video diffusion priors, 2023a.
  85. Simda: Simple diffusion adapter for efficient video generation, 2023b.
  86. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR, 2018.
  87. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  88. Temporally consistent transformers for video generation, 2023.
  89. Magvit: Masked generative video transformer. In CVPR, pages 10459–10469, 2023a.
  90. Language model beats diffusion – tokenizer is key to visual generation, 2023b.
  91. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2022.
  92. Video probabilistic diffusion models in projected latent space. In CVPR, 2023c.
  93. Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023.
  94. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  95. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  96. Hype: A benchmark for human eye perceptual evaluation of generative models. In NeurIPS, 2019.
Citations (13)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets