Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework (2403.13248v3)

Published 20 Mar 2024 in cs.CV

Abstract: Text-to-video generation has made significant strides, but replicating the capabilities of advanced systems like OpenAI Sora remains challenging due to their closed-source nature. Existing open-source methods struggle to achieve comparable performance, often hindered by ineffective agent collaboration and inadequate training data quality. In this paper, we introduce Mora, a novel multi-agent framework that leverages existing open-source modules to replicate Sora functionalities. We address these fundamental limitations by proposing three key techniques: (1) multi-agent fine-tuning with a self-modulation factor to enhance inter-agent coordination, (2) a data-free training strategy that uses large models to synthesize training data, and (3) a human-in-the-loop mechanism combined with multimodal LLMs for data filtering to ensure high-quality training datasets. Our comprehensive experiments on six video generation tasks demonstrate that Mora achieves performance comparable to Sora on VBench, outperforming existing open-source methods across various tasks. Specifically, in the text-to-video generation task, Mora achieved a Video Quality score of 0.800, surpassing Sora 0.797 and outperforming all other baseline models across six key metrics. Additionally, in the image-to-video generation task, Mora achieved a perfect Dynamic Degree score of 1.00, demonstrating exceptional capability in enhancing motion realism and achieving higher Imaging Quality than Sora. These results highlight the potential of collaborative multi-agent systems and human-in-the-loop mechanisms in advancing text-to-video generation. Our code is available at \url{https://github.com/lichao-sun/Mora}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. OpenAI. Chatgpt, 2023a. https://openai.com/product/chatgpt.
  2. OpenAI. Gpt-4 technical report. 2023b. URL https://arxiv.org/pdf/2303.08774.pdf.
  3. Midjourney. Midjourney, 2023. URL https://www.midjourney.com/home.
  4. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
  5. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  6. Pika. URL https://pika.art/home.
  7. Gen-2. URL https://runwayml.com/ai-tools/gen-2/.
  8. OpenAI. Sora: Creating video from text. https://openai.com/sora, 2024a.
  9. Moviefactory: Automatic movie creation from text using large generative models for language and images. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9313–9319, 2023.
  10. Vlogger: Make your dream a vlog. arXiv preprint arXiv:2401.09414, 2024.
  11. Dall-e-bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 2023.
  12. Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects. In Workshop on Language and Robotics at CoRL 2022, 2022.
  13. Video diffusion models for the apoptosis forcasting. bioRxiv, pages 2023–11, 2023.
  14. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024a.
  15. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  16. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  17. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  18. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  19. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  20. Lavie: High-quality video generation with cascaded latent diffusion models, 2023a.
  21. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  22. Latte: Latent diffusion transformer for video generation, 2024a.
  23. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  24. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  25. Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749, 2023.
  26. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
  27. Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations, 2023a.
  28. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  29. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023a.
  30. Make-a-video: Text-to-video generation without text-video data. In International Conference on Learning Representations, 2023.
  31. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  32. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
  33. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023.
  34. Imaginator: Conditional spatio-temporal gan for video generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1160–1169, 2020.
  35. Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG), 39(4):75–1, 2020.
  36. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  37. Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision, pages 720–736. Springer, 2022.
  38. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022a.
  39. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  40. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  41. Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022.
  42. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  43. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
  44. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
  45. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  46. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  47. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023b.
  48. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024.
  49. High-resolution image synthesis with latent diffusion models, 2022b.
  50. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  51. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
  52. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  53. Videdit: Zero-shot and spatially aware text-driven video editing. arXiv preprint arXiv:2306.08707, 2023.
  54. Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023.
  55. Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23040–23050, 2023.
  56. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024b.
  57. Prioritizing safeguarding over autonomy: Risks of llm agents for science. arXiv preprint arXiv:2402.04247, 2024.
  58. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
  59. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
  60. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024.
  61. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36, 2024.
  62. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
  63. Towards reasoning in large language models via multi-agent peer review collaboration. arXiv preprint arXiv:2311.08152, 2023.
  64. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  65. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  66. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  67. Sdedit: Guided image synthesis and editing with stochastic differential equations, 2022.
  68. U-net: Convolutional networks for biomedical image segmentation, 2015.
  69. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  70. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  71. Learning transferable visual models from natural language supervision. In ICML, 2021.
  72. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  73. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  74. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023c.
  75. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  76. Modelscope. URL https://huggingface.co/ali-vilab/modelscope-damo-text-to-video-synthesis.
  77. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
  78. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022b.
  79. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  80. Amt: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023.
  81. Aesthetic predictor. URL https://github.com/LAION-AI/aesthetic-predictor.
  82. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  83. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021.
  84. Perceptual quality assessment of smartphone photography. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3677–3686, 2020.
  85. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023c.
  86. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
  87. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023b. URL https://arxiv.org/abs/2306.02858.
  88. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  89. OpenAI. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Zhengqing Yuan (17 papers)
  2. Ruoxi Chen (22 papers)
  3. Zhaoxu Li (7 papers)
  4. Haolong Jia (3 papers)
  5. Lifang He (98 papers)
  6. Chi Wang (93 papers)
  7. Lichao Sun (186 papers)
  8. Yixin Liu (108 papers)
  9. Yihan Cao (14 papers)
  10. Weixiang Sun (20 papers)
  11. Bin Lin (33 papers)
  12. Li Yuan (141 papers)
  13. Yanfang Ye (67 papers)
Citations (17)

Summary

Introducing TrustGPT: Benchmarks for Assessing the Ethical Dimensions of LLMs

Overview of TrustGPT

Within the rapidly evolving landscape of natural language processing technologies, LLMs have emerged as powerful tools capable of performing a wide range of tasks. However, alongside their significant benefits, these models pose ethical challenges that necessitate a careful evaluation of their societal impacts. To bridge this gap, researchers Yue Huang, Qihui Zhang, and Lichao Sun have proposed TrustGPT, a comprehensive benchmark designed to assess LLMs across three critical ethical dimensions: toxicity, bias, and value alignment.

Key Contributions

TrustGPT stands out through its targeted approach to evaluating LLMs by focusing on:

  • Toxicity: Identifying and measuring the extent to which LLMs can generate harmful or inappropriate content based on various social norms.
  • Bias: Investigating biases within LLMs by analyzing their responses across different demographic groups and quantifying any identified disparities.
  • Value Alignment: Examining how well the outputs of LLMs align with human ethical values, categorized into active and passive alignments.

This meticulously designed benchmark offers researchers and developers a structured framework for scrutinizing the ethical impacts of their LLMs, thereby paving the way for the development of more responsible and socially-aware language technologies.

Empirical Evaluation and Insights

Utilizing TrustGPT, the authors conducted a thorough evaluation of eight state-of-the-art LLMs, including the well-known ChatGPT and LLaMA models. The empirical analysis shed light on several crucial aspects:

  • Toxicity assessments revealed varying levels of potential harmful content generation among the models, with certain models displaying a higher propensity for toxicity under specific prompts.
  • Bias metrics underscored the existence of significant biases in several models, particularly towards specific demographic groups, highlighting the urgent need for bias mitigation strategies.
  • Value alignment tasks illustrated the challenges models face in aligning their outputs with human ethical standards, especially under complex or ambiguous scenarios.

These findings underscore the importance of ongoing efforts to address ethical considerations in the development and deployment of LLMs.

The Path Forward

The introduction of TrustGPT marks a significant step towards a more ethical and responsible approach to LLM development. By highlighting the potential risks and ethical dilemmas associated with these technologies, this research encourages the AI community to prioritize the development of LLMs that not only excel in task performance but also adhere to societal norms and values.

Future research directions inspired by TrustGPT could include the exploration of more nuanced ethical frameworks, the development of advanced bias mitigation techniques, and the creation of more sophisticated models capable of navigating the complex landscape of human ethics.

In conclusion, TrustGPT serves as a valuable tool for the AI research community, offering insights into the ethical dimensions of LLMs and guiding the development of more ethical, transparent, and equitable language technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com