FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation (2311.01813v3)
Abstract: Recently, open-domain text-to-video (T2V) generation models have made remarkable progress. However, the promising results are mainly shown by the qualitative cases of generated videos, while the quantitative evaluation of T2V models still faces two critical problems. Firstly, existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. Although some benchmarks have categorized the prompts, their categorization either only focuses on a single aspect or fails to consider the temporal information in video generation. Secondly, it is unclear whether the automatic evaluation metrics are consistent with human standards. To address these problems, we propose FETV, a benchmark for Fine-grained Evaluation of Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. FETV is also temporal-aware, which introduces several temporal categories tailored for video generation. Based on FETV, we conduct comprehensive manual evaluations of four representative T2V models, revealing their pros and cons on different categories of prompts from different aspects. We also extend FETV as a testbed to evaluate the reliability of automatic T2V metrics. The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios. We find that existing automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human evaluation. To address this problem, we explore several solutions to improve CLIPScore and FVD, and develop two automatic metrics that exhibit significant higher correlation with humans than existing metrics. Benchmark page: https://github.com/llyx97/FETV.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1708–1718. IEEE, 2021.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pages 4724–4733. IEEE Computer Society, 2017.
- A short note about kinetics-600. CoRR, abs/1808.01340, 2018.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
- Multi-modal transformer for video retrieval. In ECCV (4), volume 12349 of Lecture Notes in Computer Science, pages 214–229. Springer, 2020.
- Latent video diffusion models for high-fidelity long video generation. CoRR, abs/2211.13221, 2022.
- Clipscore: A reference-free evaluation metric for image captioning. In EMNLP (1), pages 7514–7528. Association for Computational Linguistics, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303, 2022a.
- Video diffusion models. In NeurIPS, 2022b.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. CoRR, abs/2205.15868, 2022.
- The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. CoRR, abs/2303.13439, 2023.
- Mutual information divergence: A unified metric for multimodal generative models. In NeurIPS, 2022.
- K. Klaus. Content analysis: An introduction to its methodology. 1980.
- Otter: A multi-modal model with in-context instruction tuning. CoRR, abs/2305.03726, 2023a.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 2022.
- Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023b.
- Microsoft COCO: common objects in context. In ECCV (5), volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014.
- Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Videofusion: Decomposed diffusion models for high-quality video generation. CoRR, abs/2303.08320, 2023.
- Dreamix: Video diffusion models are general video editors. CoRR, abs/2302.01329, 2023.
- Toward verifiable and reproducible human evaluation for text-to-image generation. In CVPR. IEEE, 2023.
- On aliased resizing and surprising subtleties in GAN evaluation. In CVPR, pages 11400–11410. IEEE, 2022.
- Learning transferable visual models from natural language supervision. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Improved techniques for training gans. In NIPS, pages 2226–2234, 2016.
- LAION-5B: an open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Make-a-video: Text-to-video generation without text-video data. CoRR, abs/2209.14792, 2022.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3616–3626. IEEE, 2022.
- A short note on the kinetics-700-2020 human action dataset. CoRR, abs/2010.10864, 2020.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 2256–2265. JMLR.org, 2015.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
- S. Sterling. zeroscope-v2, June 2023.
- FVD: A new metric for video generation. In DGS@ICLR. OpenReview.net, 2019.
- Neural discrete representation learning. In NIPS, pages 6306–6315, 2017.
- Attention is all you need. In NIPS, pages 5998–6008, 2017.
- Phenaki: Variable length video generation from open domain textual description. CoRR, abs/2210.02399, 2022.
- Modelscope text-to-video technical report. CoRR, abs/2308.06571, 2023.
- GODIVA: generating open-domain videos from natural descriptions. CoRR, abs/2104.14806, 2021.
- Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV (16), volume 13676 of Lecture Notes in Computer Science, pages 720–736. Springer, 2022a.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. CoRR, abs/2212.11565, 2022b.
- MSR-VTT: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296. IEEE Computer Society, 2016.
- Diffusion probabilistic modeling for video generation. CoRR, abs/2203.09481, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res., 2022, 2022.
- Magicvideo: Efficient video generation with latent diffusion models. CoRR, abs/2211.11018, 2022.
- Yuanxin Liu (28 papers)
- Lei Li (1293 papers)
- Shuhuai Ren (30 papers)
- Rundong Gao (7 papers)
- Shicheng Li (23 papers)
- Sishuo Chen (13 papers)
- Xu Sun (194 papers)
- Lu Hou (50 papers)