Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI (2401.01651v3)

Published 3 Jan 2024 in cs.CV and cs.AI

Abstract: The burgeoning field of Artificial Intelligence Generated Content (AIGC) is witnessing rapid advancements, particularly in video generation. This paper introduces AIGCBench, a pioneering comprehensive and scalable benchmark designed to evaluate a variety of video generation tasks, with a primary focus on Image-to-Video (I2V) generation. AIGCBench tackles the limitations of existing benchmarks, which suffer from a lack of diverse datasets, by including a varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models. To establish a unified evaluation framework for video generation tasks, our benchmark includes 11 metrics spanning four dimensions to assess algorithm performance. These dimensions are control-video alignment, motion effects, temporal consistency, and video quality. These metrics are both reference video-dependent and video-free, ensuring a comprehensive evaluation strategy. The evaluation standard proposed correlates well with human judgment, providing insights into the strengths and weaknesses of current I2V algorithms. The findings from our extensive experiments aim to stimulate further research and development in the I2V field. AIGCBench represents a significant step toward creating standardized benchmarks for the broader AIGC landscape, proposing an adaptable and equitable framework for future assessments of video generation tasks. We have open-sourced the dataset and evaluation code on the project website: https://www.benchcouncil.org/AIGCBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 .
  3. Align your latents: High-resolution video synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575.
  4. Fast unfolding of communities in large networks. J. Stat. Mech.-Theory Exp. 2008, P10008.
  5. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 .
  6. Civitai. https://civitai.com/ [Accessed: (2022)].
  7. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794.
  8. Structure and content-guided video synthesis with diffusion models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356.
  9. Preserve your own correlation: A noise prior for video diffusion models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22930–22941.
  10. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709 .
  11. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897 .
  12. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 .
  13. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221 .
  14. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 .
  15. Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851.
  16. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 .
  17. A benchmark for controllable text-image-to-video generation. IEEE Transactions on Multimedia .
  18. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350 .
  19. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982 .
  20. Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025 .
  21. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 .
  22. Generative image dynamics. arXiv preprint arXiv:2309.07906 .
  23. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440 .
  24. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. arXiv preprint arXiv:2311.01813 .
  25. Videofusion: Decomposed diffusion models for high-quality video generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10209–10218.
  26. Conditional image-to-video generation with latent flow diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18444–18455.
  27. Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR. pp. 8162–8171.
  28. OpenAI, 2023. Gpt-4 technical report. arXiv:2303.08774.
  29. Pika lab discord server. https://www.pika.art/ [Accessed: (2023-08-30)].
  30. Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR. pp. 8748--8763.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 5485--5551.
  32. Zero-shot text-to-image generation, in: International Conference on Machine Learning, PMLR. pp. 8821--8831.
  33. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684--10695.
  34. U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention--MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer. pp. 234--241.
  35. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv:2210.08402.
  36. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 .
  37. Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR. pp. 2256--2265.
  38. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems 32.
  39. Generative multimodal models are in-context learners. arXiv:2312.13286.
  40. Raft: Recurrent all-pairs field transforms for optical flow, in: Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16, Springer. pp. 402--419.
  41. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 .
  42. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018 .
  43. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 600--612.
  44. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20144--20154.
  45. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623--7633.
  46. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089 .
  47. Evaluatology: The Science and Engineering of Evaluation. Technical Report. Institute of Computing Technology Chinese Academy of Sciences.
  48. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145 .
  49. Pia: Your personalized image animator via plug-and-play modules in text-to-image models. arXiv preprint arXiv:2312.13964 .
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Fanda Fan (8 papers)
  2. Chunjie Luo (39 papers)
  3. Jianfeng Zhan (92 papers)
  4. Wanling Gao (47 papers)
Citations (12)