Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Sora What We Can See: A Survey of Text-to-Video Generation (2405.10674v1)

Published 17 May 2024 in cs.CV and cs.AI

Abstract: With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.

A Survey of Text-to-Video Generation

Introduction

It's exciting to see how far we’ve come in generating videos from text prompts. This field, known as Text-to-Video (T2V) generation, has seen substantial advancements, especially with the emergence of models like OpenAI's Sora. You might be familiar with models that generate images from text, but video generation adds layers of complexity because it must account for temporal coherence—keeping the video smooth and logical over time. Let's dive into a comprehensive survey on this subject as outlined by Rui Sun et al.

Evolutionary Generators

The journey of T2V generation can be largely segmented based on the foundational algorithms: GAN/VAE-based, Diffusion-based, and Autoregressive-based.

GAN/VAE-based

Early works leaned heavily on Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). These models paved the way but had limitations in handling video dynamics effectively. For instance:

  • TGANs-C integrated 3D convolutions for capturing temporal dynamics and ensuring semantic consistency across frames.
  • GODIVA used VQ-VAE, trained on HowTo100M, to handle zero-shot video generation capabilities.

Diffusion-based

Inspired by the success of diffusion models in text-to-image (T2I) tasks, researchers applied similar principles to T2V. Some notable advancements include:

  • VDM improved upon the traditional 2D diffusion models by incorporating 3D U-Net architectures.
  • Make-A-Video and Imagen Video capitalized on pre-trained T2I models to enhance motion realism and generate longer videos.
  • STA-DM further made strides by focusing on maintaining temporal and spatial coherence within generated videos.

Autoregressive-based

Autoregressive transformers have become quite effective for T2V tasks, especially for long-video generation:

  • NUWA-Infinity can generate videos frame by frame, maintaining coherence for extended sequences.
  • Phenaki incorporates a tokenized video representation to handle variable-length video generation, showcasing excellent temporal dynamics.

Excellent Pursuit

To achieve superior video generation, models focus on three critical aspects: extended duration, superior resolution, and seamless quality.

Extended Duration

Models like TATS and NUWA-XL show how hierarchical or autoregressive frameworks can generate long-duration videos by maintaining temporal coherence across many frames.

Superior Resolution

Generating high-resolution videos is crucial and challenging. For example, Show-1 uses a hybrid model combining both pixel-based and latent-based diffusion models to upscale videos, achieving high resolution while maintaining quality.

Seamless Quality

To enhance frame-to-frame quality and consistency, methods like FLAVR leverage 3D spatio-temporal convolutions, ensuring that videos are not only high in resolution but also fluid and artifact-free.

Realistic Panorama

Several elements are key to making T2V videos realistic: dynamic motion, complex scenes, multiple objects, and a rational layout.

Dynamic Motion

Models like AnimateDiff incorporate temporal-spatial learning to handle motion dynamics effectively, ensuring actions within the videos appear natural and coherent over time.

Complex Scene

Leveraging LLMs like VideoDirectorGPT, some models can generate intricate scenes with multiple interacting elements.

Multiple Objects

Handling multiple objects involves challenges like attribute mixing and object disappearance. Innovations like Detector Guidance (DG) help separate and clarify objects, maintaining their unique characteristics throughout the video.

Rational Layout

Creating a rational layout that adheres to physical and spatial principles is critical. LLM-grounded Video Diffusion (LVD) helps generate structured scene layouts that guide video creation, ensuring the sequence aligns with logical and realistic layouts.

Datasets and Metrics

The paper also dives into various datasets and evaluation metrics crucial for training and assessing T2V models. Key datasets span domains like face, open, movie, action, instruction, and cooking. Evaluation metrics range from image-level metrics like PSNR and SSIM to video-specific metrics such as Video Inception Score (Video IS) and Fréchet Video Distance (FVD), ensuring comprehensive quality assessment.

Challenges and Open Problems

Despite the advances, several challenges remain:

  • Realistic Motion and Coherence: Ensuring video frames transition smoothly and actions appear natural remains a significant hurdle.
  • Data Access Privacy: Leveraging private datasets while ensuring privacy.
  • Simultaneous Multi-shot Video Generation: Generating videos with consistent characters and styles across multiple shots.
  • Multi-Agent Co-creation: Collaborating in a multi-agent setup to achieve complex video generation tasks.

Future Directions

Looking ahead, the paper suggests some intriguing future directions:

  • Robot Learning from Visual Assistance: Using generated videos to aid robots in learning new tasks through demonstration.
  • Infinity 3D Dynamic Scene Reconstruction and Generation: Combining Sora with 3D technologies like NeRF for real-time, infinite scene generation.
  • Augmented Digital Twins: Enhancing digital twin systems by incorporating Sora’s simulation capabilities to improve real-time data accuracy and interactivity.

Conclusion

The paper by Rui Sun et al. provides a detailed exploration of T2V generation, highlighting impressive advances and outlining challenges and future opportunities. As T2V models continue to evolve, their applications—from enhancing robotics to improving digital twins—will undoubtedly expand, making this an exciting field to watch.

For those interested in keeping up with the latest in T2V, you might want to explore the studies surveyed in the paper, many of which are listed in detail at this GitHub repository.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (154)
  1. “Introducing chatgpt,” 2024.
  2. J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023.
  3. “Midjourney,” 2024.
  4. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
  5. T. Brooks, B. Peebles, C. Homes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. W. Y. Ng, R. Wang, and A. Ramesh, “Video generation models as world simulators,” 2024.
  6. W. Peebles and S. Xie, “Scalable diffusion models with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
  7. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241, Springer, 2015.
  8. S. Chen, M. Xu, J. Ren, Y. Cong, S. He, Y. Xie, A. Sinha, P. Luo, T. Xiang, and J.-M. Perez-Rua, “Gentron: Delving deep into diffusion transformers for image and video generation,” arXiv preprint arXiv:2312.04557, 2023.
  9. A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, L. Fei-Fei, I. Essa, L. Jiang, and J. Lezama, “Photorealistic video generation with diffusion models,” arXiv preprint arXiv:2312.06662, 2023.
  10. X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao, “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024.
  11. X. Chen, C. Xu, X. Yang, and D. Tao, “Long-term video prediction via criticization and retrospection,” IEEE Transactions on Image Processing, vol. 29, pp. 7090–7103, 2020.
  12. S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh, “Long video generation with time-agnostic vqgan and time-sensitive transformer,” in European Conference on Computer Vision, pp. 102–118, Springer, 2022.
  13. R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual description,” arXiv preprint arXiv:2210.02399, 2022.
  14. Y. He, T. Yang, Y. Zhang, Y. Shan, and Q. Chen, “Latent video diffusion models for high-fidelity video generation with arbitrary lengths,” arXiv preprint arXiv:2211.13221, 2022.
  15. Y. Wang, L. Jiang, and C. C. Loy, “Styleinv: A temporal style modulated inversion network for unconditional video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22851–22861, 2023.
  16. S. Zhuang, K. Li, X. Chen, Y. Wang, Z. Liu, Y. Qiao, and Y. Wang, “Vlogger: Make your dream a vlog,” arXiv preprint arXiv:2401.09414, 2024.
  17. D. J. Zhang, D. Li, H. Le, M. Z. Shou, C. Xiong, and D. Sahoo, “Moonshot: Towards controllable video generation and editing with multimodal conditions,” arXiv preprint arXiv:2401.01827, 2024.
  18. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847, 2023.
  19. V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” Advances in neural information processing systems, vol. 33, pp. 7462–7473, 2020.
  20. S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J.-W. Ha, and J. Shin, “Generating videos with dynamics-aware implicit generative adversarial networks,” arXiv preprint arXiv:2202.10571, 2022.
  21. I. Skorokhodov, S. Tulyakov, and M. Elhoseiny, “Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3626–3636, 2022.
  22. F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen-l-video: Multi-text to long video generation via temporal co-denoising,” arXiv preprint arXiv:2305.18264, 2023.
  23. X. Chen, Y. Wang, L. Zhang, S. Zhuang, X. Ma, J. Yu, Y. Wang, D. Lin, Y. Qiao, and Z. Liu, “Seine: Short-to-long video diffusion model for generative transition and prediction,” in The Twelfth International Conference on Learning Representations, 2023.
  24. R. Fridman, A. Abecasis, Y. Kasten, and T. Dekel, “Scenescape: Text-driven consistent scene generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  25. S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang, et al., “Nuwa-xl: Diffusion over diffusion for extremely long video generation,” arXiv preprint arXiv:2303.12346, 2023.
  26. V. Voleti, A. Jolicoeur-Martineau, and C. Pal, “Mcvd-masked conditional video diffusion for prediction, generation, and interpolation,” Advances in Neural Information Processing Systems, vol. 35, pp. 23371–23385, 2022.
  27. A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023.
  28. D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” arXiv preprint arXiv:2309.15818, 2023.
  29. O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, Y. Li, T. Michaeli, et al., “Lumiere: A space-time diffusion model for video generation,” arXiv preprint arXiv:2401.12945, 2024.
  30. Y. Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov, “A good image generator is what you need for high-resolution video synthesis,” arXiv preprint arXiv:2104.15069, 2021.
  31. Y. Jiang, S. Yang, T. L. Koh, W. Wu, C. C. Loy, and Z. Liu, “Text2performer: Text-driven human video generation,” arXiv preprint arXiv:2304.08483, 2023.
  32. W. Bao, W.-S. Lai, C. Ma, X. Zhang, Z. Gao, and M.-H. Yang, “Depth-aware video frame interpolation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3703–3712, 2019.
  33. Y.-L. Liu, Y.-T. Liao, Y.-Y. Lin, and Y.-Y. Chuang, “Deep video frame interpolation using cyclic frame generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8794–8802, 2019.
  34. S. Niklaus and F. Liu, “Softmax splatting for video frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5437–5446, 2020.
  35. T. Kalluri, D. Pathak, M. Chandraker, and D. Tran, “Flavr: Flow-agnostic video representations for fast frame interpolation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2071–2082, 2023.
  36. L. Liu, Z. Zhang, Y. Ren, R. Huang, X. Yin, and Z. Zhao, “Detector guidance for multi-object text-to-image generation,” arXiv preprint arXiv:2306.02236, 2023.
  37. Y. Wu, Z. Liu, H. Wu, and L. Lin, “Multi-object video generation from single frame layouts,” arXiv preprint arXiv:2305.03983, 2023.
  38. H. Chen, X. Wang, G. Zeng, Y. Zhang, Y. Zhou, F. Han, and W. Zhu, “Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning,” arXiv preprint arXiv:2311.00990, 2023.
  39. L. Ruan, L. Tian, C. Huang, X. Zhang, and X. Xiao, “Univg: Towards unified-modal video generation,” arXiv preprint arXiv:2401.09084, 2024.
  40. C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” Advances in neural information processing systems, vol. 29, 2016.
  41. H. Lin, A. Zala, J. Cho, and M. Bansal, “Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning,” arXiv preprint arXiv:2309.15091, 2023.
  42. Y. Lu, L. Zhu, H. Fan, and Y. Yang, “Flowzero: Zero-shot text-to-video synthesis with llm-driven dynamic scene syntax,” arXiv preprint arXiv:2311.15813, 2023.
  43. F. Long, Z. Qiu, T. Yao, and T. Mei, “Videodrafter: Content-consistent multi-scene video generation with llm,” arXiv preprint arXiv:2401.01256, 2024.
  44. T. Gupta, D. Schwenk, A. Farhadi, D. Hoiem, and A. Kembhavi, “Imagine this! scripts to compositions to videos,” in Proceedings of the European conference on computer vision (ECCV), pp. 598–613, 2018.
  45. L. Lian, B. Shi, A. Yala, T. Darrell, and B. Li, “Llm-grounded video diffusion models,” arXiv preprint arXiv:2309.17444, 2023.
  46. Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y.-G. Jiang, “A survey on video diffusion models,” arXiv preprint arXiv:2310.10647, 2023.
  47. Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al., “Sora: A review on background, technology, limitations, and opportunities of large vision models,” arXiv preprint arXiv:2402.17177, 2024.
  48. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  49. A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE signal processing magazine, vol. 35, no. 1, pp. 53–65, 2018.
  50. K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative adversarial networks: introduction and outlook,” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 4, pp. 588–598, 2017.
  51. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  52. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  53. “Autoregressive model,” 2024.
  54. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  55. G. Mittal, T. Marwah, and V. N. Balasubramanian, “Sync-draw: Automatic video generation using deep recurrent attentive architectures,” in Proceedings of the 25th ACM international conference on Multimedia, pp. 1096–1104, 2017.
  56. A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
  57. C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, and N. Duan, “Godiva: Generating open-domain videos from natural descriptions,” arXiv preprint arXiv:2104.14806, 2021.
  58. Y. Li, M. Min, D. Shen, D. Carlson, and L. Carin, “Video generation from text,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018.
  59. K. Deng, T. Fei, X. Huang, and Y. Peng, “Irc-gan: Introspective recurrent convolutional gan for text-to-video generation.,” in IJCAI, pp. 2216–2222, 2019.
  60. Y. Pan, Z. Qiu, T. Yao, H. Li, and T. Mei, “To create what you tell: Generating videos from captions,” in Proceedings of the 25th ACM international conference on Multimedia, pp. 1789–1798, 2017.
  61. A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 2630–2640, 2019.
  62. J. S. Sepp Hochreiter, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997.
  63. A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023.
  64. J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” 2022.
  65. Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pp. 424–432, Springer, 2016.
  66. D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.
  67. Y. Zeng, G. Wei, J. Zheng, J. Zou, Y. Wei, Y. Zhang, and H. Li, “Make pixels dance: High-dynamic video generation,” arXiv preprint arXiv:2311.10982, 2023.
  68. J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633, 2023.
  69. X. Wang, S. Zhang, H. Zhang, Y. Liu, Y. Zhang, C. Gao, and N. Sang, “Videolcm: Video latent consistency model,” arXiv preprint arXiv:2312.09109, 2023.
  70. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  71. H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” arXiv preprint arXiv:2401.09047, 2024.
  72. H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang, et al., “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023.
  73. U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022.
  74. J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
  75. M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  76. L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image diffusion models are zero-shot video generators,” arXiv preprint arXiv:2303.13439, 2023.
  77. W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, 2021.
  78. C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan, “Nüwa: Visual synthesis pre-training for neural visual world creation,” in European conference on computer vision, pp. 720–736, Springer, 2022.
  79. C. Wu, J. Liang, X. Hu, Z. Gan, J. Wang, L. Wang, Z. Liu, Y. Fang, and N. Duan, “Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis,” arXiv preprint arXiv:2207.09814, 2022.
  80. H. Liu, W. Yan, M. Zaharia, and P. Abbeel, “World model on million-length video and language with ringattention,” arXiv preprint arXiv:2402.08268, 2024.
  81. J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al., “Genie: Generative interactive environments,” arXiv preprint arXiv:2402.15391, 2024.
  82. W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “Cogvideo: Large-scale pretraining for text-to-video generation via transformers,” arXiv preprint arXiv:2205.15868, 2022.
  83. M. Ding, W. Zheng, W. Hong, and J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” Advances in Neural Information Processing Systems, vol. 35, pp. 16890–16902, 2022.
  84. “Creating video from text,” 2024.
  85. R. Wu, L. Chen, T. Yang, C. Guo, C. Li, and X. Zhang, “Lamp: Learn a motion pattern for few-shot-based video generation,” arXiv preprint arXiv:2310.10769, 2023.
  86. Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
  87. J. Liang, Y. Fan, K. Zhang, R. Timofte, L. Van Gool, and R. Ranjan, “Movideo: Motion-aware video generation with diffusion models,” arXiv preprint arXiv:2311.11325, 2023.
  88. H. Fei, S. Wu, W. Ji, H. Zhang, and T.-S. Chua, “Empowering dynamics-aware text-to-video diffusion with large language models,” arXiv preprint arXiv:2308.13812, 2023.
  89. W. Weng, R. Feng, Y. Wang, Q. Dai, C. Wang, D. Yin, Z. Zhao, K. Qiu, J. Bao, Y. Yuan, et al., “Art•v: Auto-regressive text-to-video generation with diffusion models,” arXiv preprint arXiv:2311.18834, 2023.
  90. J. Xing, M. Xia, Y. Zhang, H. Chen, X. Wang, T.-T. Wong, and Y. Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190, 2023.
  91. Y. Wang, J. Bao, W. Weng, R. Feng, D. Yin, T. Yang, J. Zhang, Q. D. Z. Zhao, C. Wang, K. Qiu, et al., “Microcinema: A divide-and-conquer approach for text-to-video generation,” arXiv preprint arXiv:2311.18829, 2023.
  92. B. Peng, X. Chen, Y. Wang, C. Lu, and Y. Qiao, “Conditionvideo: Training-free condition-guided text-to-video generation,” arXiv preprint arXiv:2310.07697, 2023.
  93. Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” arXiv preprint arXiv:2312.04433, 2023.
  94. X. Wang, S. Zhang, H. Yuan, Z. Qing, B. Gong, Y. Zhang, Y. Shen, C. Gao, and N. Sang, “A recipe for scaling up text-to-video generation with text-free videos,” arXiv preprint arXiv:2312.15770, 2023.
  95. J. Lv, Y. Huang, M. Yan, J. Huang, J. Liu, Y. Liu, Y. Wen, X. Chen, and S. Chen, “Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,” arXiv preprint arXiv:2311.12631, 2023.
  96. J. Yu, H. Zhu, L. Jiang, C. C. Loy, W. Cai, and W. Wu, “Celebv-text: A large-scale facial text-video dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14805–14814, 2023.
  97. J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296, 2016.
  98. L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, “Localizing moments in video with natural language,” in Proceedings of the IEEE international conference on computer vision, pp. 5803–5812, 2017.
  99. R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” Advances in Neural Information Processing Systems, vol. 34, pp. 23634–23651, 2021.
  100. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738, 2021.
  101. H. Xue, T. Hang, Y. Zeng, Y. Sun, B. Liu, H. Yang, J. Fu, and B. Guo, “Advancing high-resolution video-language representation with large-scale video transcriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5036–5045, 2022.
  102. Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y. Wang, et al., “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” arXiv preprint arXiv:2307.06942, 2023.
  103. W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, and J. Liu, “Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation,” arXiv preprint arXiv:2305.10874, 2023.
  104. H. Xu, Q. Ye, X. Wu, M. Yan, Y. Miao, J. Ye, G. Xu, A. Hu, Y. Shi, G. Xu, et al., “Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,” arXiv preprint arXiv:2306.04362, 2023.
  105. S. Chen, H. Li, Q. Wang, Z. Zhao, M. Sun, X. Zhu, and J. Liu, “Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  106. T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang, et al., “Panda-70m: Captioning 70m videos with multiple cross-modality teachers,” arXiv preprint arXiv:2402.19479, 2024.
  107. A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele, “Movie description,” International Journal of Computer Vision, vol. 123, pp. 94–120, 2017.
  108. M. Soldan, A. Pardo, J. L. Alcázar, F. Caba, C. Zhao, S. Giancola, and B. Ghanem, “Mad: A scalable dataset for language grounding in videos from movie audio descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5026–5035, 2022.
  109. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  110. F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015.
  111. G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 510–526, Springer, 2016.
  112. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  113. R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning events in videos,” in Proceedings of the IEEE international conference on computer vision, pp. 706–715, 2017.
  114. G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari, “Charades-ego: A large-scale dataset of paired third and first person videos,” arXiv preprint arXiv:1804.09626, 2018.
  115. R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al., “The” something something” video database for learning and evaluating visual common sense,” in Proceedings of the IEEE international conference on computer vision, pp. 5842–5850, 2017.
  116. R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, “How2: a large-scale dataset for multimodal language understanding,” arXiv preprint arXiv:1811.00347, 2018.
  117. L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of procedures from web instructional videos,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
  118. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European conference on computer vision (ECCV), pp. 720–736, 2018.
  119. “Youku.” https://www.youku.com/.
  120. “Youtube.” https://www.youtube.com/.
  121. Z. Xing, Q. Dai, H. Hu, Z. Wu, and Y.-G. Jiang, “Simda: Simple diffusion adapter for efficient video generation,” arXiv preprint arXiv:2308.09710, 2023.
  122. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  123. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
  124. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
  125. J. R. Hershey and P. A. Olsen, “Approximating the kullback leibler divergence between gaussian mixture models,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4, pp. IV–317, IEEE, 2007.
  126. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  127. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
  128. M. Saito, S. Saito, M. Koyama, and S. Kobayashi, “Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan,” International Journal of Computer Vision, vol. 128, no. 10-11, pp. 2586–2606, 2020.
  129. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
  130. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
  131. T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Fvd: A new metric for video generation,” 2019.
  132. T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,” arXiv preprint arXiv:1812.01717, 2018.
  133. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics, pp. 1273–1282, PMLR, 2017.
  134. J. Zhang, S. Vahidian, M. Kuo, C. Li, R. Zhang, G. Wang, and Y. Chen, “Towards building the federated gpt: Federated instruction tuning,” arXiv preprint arXiv:2305.05644, 2023.
  135. X. Huang, P. Li, H. Du, J. Kang, D. Niyato, D. I. Kim, and Y. Wu, “Federated learning-empowered ai-generated content in wireless networks,” IEEE Network, 2024.
  136. S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al., “Metagpt: Meta programming for multi-agent collaborative framework,” arXiv preprint arXiv:2308.00352, 2023.
  137. G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for” mind” exploration of large scale language model society,” arXiv preprint arXiv:2303.17760, 2023.
  138. Y. Talebirad and A. Nadiri, “Multi-agent collaboration: Harnessing the power of intelligent llm agents,” arXiv preprint arXiv:2306.03314, 2023.
  139. S. Han, Q. Zhang, Y. Yao, W. Jin, Z. Xu, and C. He, “Llm multi-agent systems: Challenges and open problems,” arXiv preprint arXiv:2402.03578, 2024.
  140. S. Abdelnabi, A. Gomaa, S. Sivaprasad, L. Schönherr, and M. Fritz, “Llm-deliberation: Evaluating llms with interactive multi-agent negotiation games,” arXiv preprint arXiv:2309.17234, 2023.
  141. H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Recent advances in robot learning from demonstration,” Annual review of control, robotics, and autonomous systems, vol. 3, pp. 297–330, 2020.
  142. C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” arXiv preprint arXiv:2402.10329, 2024.
  143. W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” arXiv preprint arXiv:2307.05973, 2023.
  144. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” in The European Conference on Computer Vision (ECCV), 2020.
  145. B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, 2023.
  146. W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled feature fields enable few-shot language-guided manipulation,” arXiv preprint arXiv:2308.07931, 2023.
  147. M. Pollefeys, D. Nistér, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, et al., “Detailed real-time urban 3d reconstruction from video,” International Journal of Computer Vision, vol. 78, pp. 143–167, 2008.
  148. T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombe, et al., “Neural 3d video synthesis from multi-view video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5521–5531, 2022.
  149. J. Sun, Y. Xie, L. Chen, X. Zhou, and H. Bao, “Neuralrecon: Real-time coherent 3d reconstruction from monocular video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15598–15607, 2021.
  150. Z. Wu, Y. Li, H. Yan, T. Shang, W. Sun, S. Wang, R. Cui, W. Liu, H. Sato, H. Li, et al., “Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,” arXiv preprint arXiv:2401.17053, 2024.
  151. D. Jones, C. Snider, A. Nassehi, J. Yon, and B. Hicks, “Characterising the digital twin: A systematic literature review,” CIRP journal of manufacturing science and technology, vol. 29, pp. 36–52, 2020.
  152. C. Chen and K. Shu, “Can llm-generated misinformation be detected?,” arXiv preprint arXiv:2309.13788, 2023.
  153. Z. Liu, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, W. Liu, D. Shen, Q. Li, et al., “Deid-gpt: Zero-shot medical text de-identification by gpt-4,” arXiv preprint arXiv:2303.11032, 2023.
  154. J. Yao, X. Yi, X. Wang, Y. Gong, and X. Xie, “Value fulcra: Mapping large language models to the multidimensional spectrum of basic human values,” arXiv preprint arXiv:2311.10766, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Rui Sun (105 papers)
  2. Yumin Zhang (16 papers)
  3. Tejal Shah (11 papers)
  4. Jiahao Sun (20 papers)
  5. Shuoying Zhang (3 papers)
  6. Wenqi Li (59 papers)
  7. Haoran Duan (36 papers)
  8. Rajiv Ranjan (66 papers)
  9. Bo Wei (43 papers)
Citations (7)