Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production (2403.07952v1)

Published 12 Mar 2024 in cs.CV, cs.AI, and cs.MM

Abstract: The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We propose AesopAgent, an Agent-driven Evolutionary System on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated video is both rich in content and coherent. This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos. Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling. Our AesopAgent is designed for convenient service for individual users, which is available on the following page: https://aesopai.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. MusicLM: Generating music from text. CoRR, abs/2301.11325, 2023.
  2. Improving image generation with better captions.
  3. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  4. Character-centric story visualization via visual planning and token alignment. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 8259–8272. Association for Computational Linguistics, 2022.
  5. Lift yourself up: Retrieval-augmented text generation with self memory, 2023.
  6. Simple and controllable music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  7. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  9. Structure and content-guided video synthesis with diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7312–7322, 2023.
  10. Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149, 2023.
  11. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  12. Openagi: When llm meets domain experts, 2023.
  13. Generative adversarial nets. In Neural Information Processing Systems, 2014.
  14. ArtFlow Group. Artflow. https://app.artflow.ai/, 2023. [online; 2024-03].
  15. ComicAI Group. Transform your story into comics with ai image generation. https://comicai.ai/dashboard, 2023. [online; 2024-01].
  16. Hankan. human design. https://haokan.baidu.com/v?pd=wisenatural&vid=7320413949705196540, 2023. [online; 2024-01].
  17. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  18. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020.
  19. Metagpt: Meta programming for multi-agent collaborative framework. CoRR, abs/2308.00352, 2023.
  20. Large language models are frame-level directors for zero-shot text-to-video generation. CoRR, abs/2305.14330, 2023.
  21. Chatdb: Augmenting llms with databases as their symbolic memory. CoRR, abs/2306.03901, 2023.
  22. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  23. Music transformer: Generating music with long-term structure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  24. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  25. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2016.
  26. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  27. Pika Labs. Pika. https://pika.art/, 2023. [online; 2024-01].
  28. Bowen Li. Word-level fine-grained story visualization. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVI, volume 13696 of Lecture Notes in Computer Science, pages 347–362. Springer, 2022.
  29. CAMEL: communicative agents for "mind" exploration of large scale language model society. CoRR, abs/2303.17760, 2023.
  30. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR, 2023.
  31. Storygan: A sequential conditional GAN for story visualization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6329–6338. Computer Vision Foundation / IEEE, 2019.
  32. Video generation from text. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018.
  33. Visual instruction tuning. CoRR, abs/2304.08485, 2023.
  34. Jerry Liu. LlamaIndex. {https://github.com/jerryjliu/llama_index}, 11 2022.
  35. Videodrafter: Content-consistent multi-scene video generation with LLM. CoRR, abs/2401.01256, 2024.
  36. Integrating visuospatial, linguistic, and commonsense structure into story visualization. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6772–6786. Association for Computational Linguistics, 2021.
  37. RET-LLM: towards a general read-write memory for large language models. CoRR, abs/2305.14322, 2023.
  38. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  39. OpenAI. Gpt-4v(ision). https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
  40. Openai sora. https://openai.com/research/video-generation-models-as-world-simulators, 2024. [online; 2024-03].
  41. SDXL: improving latent diffusion models for high-resolution image synthesis. CoRR, abs/2307.01952, 2023.
  42. Communicative agents for software development. CoRR, abs/2307.07924, 2023.
  43. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
  44. Recommender systems with generative retrieval. arXiv preprint arXiv:2305.05065, 2023.
  45. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
  46. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022.
  47. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells III, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
  48. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510. IEEE, 2023.
  49. Chatpainter: Improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216, 2018.
  50. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  51. Vipergpt: Visual inference via python execution for reasoning. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11854–11864. IEEE, 2023.
  52. Mocogan: Decomposing motion and content for video generation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1526–1535, 2017.
  53. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6306–6315, 2017.
  54. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  55. Phenaki: Variable length video generation from open domain textual descriptions. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  56. Generating videos with scene dynamics. In Neural Information Processing Systems, 2016.
  57. Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291, 2023.
  58. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  59. Autostory: Generating diverse storytelling images with minimal human effort. CoRR, abs/2311.11243, 2023.
  60. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. CoRR, abs/2308.08155, 2023.
  61. XGPT: cross-modal generative pre-training for image captioning. In Lu Wang, Yansong Feng, Yu Hong, and Ruifang He, editors, Natural Language Processing and Chinese Computing - 10th CCF International Conference, NLPCC 2021, Qingdao, China, October 13-17, 2021, Proceedings, Part I, volume 13028 of Lecture Notes in Computer Science, pages 786–797. Springer, 2021.
  62. Auto-gpt for online decision making: Benchmarks and additional opinions. CoRR, abs/2306.02224, 2023.
  63. MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023.
  64. NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1309–1320. Association for Computational Linguistics, 2023.
  65. Making retrieval-augmented language models robust to irrelevant context. arXiv preprint arXiv:2310.01558, 2023.
  66. Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res., 2022, 2022.
  67. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023.
  68. Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331, 2023.
  69. Next-chat: An LMM for chat, detection and segmentation. CoRR, abs/2311.04498, 2023.
  70. Adding conditional control to text-to-image diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023.
  71. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554, 2023.
  72. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22490–22499. IEEE, 2023.
  73. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  74. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. CoRR, abs/2305.17144, 2023.
  75. Vlogger: Make your dream a vlog. arXiv preprint arXiv:2401.09414, 2024.
Citations (6)

Summary

We haven't generated a summary for this paper yet.