Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Multimodal Agents: A Survey (2402.15116v1)

Published 23 Feb 2024 in cs.CV, cs.AI, and cs.CL
Large Multimodal Agents: A Survey

Abstract: LLMs have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities akin to humans. Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents ( LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks integrating multiple LMAs , enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, hindering effective comparison among different LMAs . Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of LMAs and propose possible future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field. An up-to-date resource list is available at https://github.com/jun0wanan/awesome-large-multimodal-agents.

Systematic Review and Future Directions for Large Multimodal Agents Powered by LLMs

Introduction

The introduction highlights the pivotal role of LLMs in enhancing the functionality of AI agents, particularly in decision-making and reasoning tasks that closely mimic human capabilities. With the evolving landscape of AI demands, the introduction of multimodal capabilities in agents—referred to as Large Multimodal Agents (LMAs)—promises a transformative shift towards handling more sophisticated and nuanced tasks across different modalities including text, images, and videos. The paper systematically reviews the existing body of work on LMAs, categorizes them based on functionality, and explores collaborative frameworks that enhance their collective efficacy, addressing challenges in evaluation methods and defining comprehensive frameworks to aid meaningful comparisons and promote future research endeavors.

Core Components of LMA Development

Perception

Perception modules are responsible for multimodal data processing, extracting and interpreting useful information from varied inputs such as images, video, and audio to facilitate efficient decision-making. Recent advancements are noted in their ability to handle sophisticated data inputs which significantly enhances their utility in real-world scenarios.

Planning and Decision Making

The planning aspect reviews existing planners across models, formats, and methodologies, showcasing their critical role in strategy formulation and decision-making. Current systems rely heavily on proprietary models like GPT-3.5 and GPT-4. Comparative analysis between static and dynamic planning methodologies underscores the tendency towards dynamic planning for error adjustment during tasks.

Action Execution

Action components classify into tool use, embodied actions, and virtual interactions with systems. It extensively covers the range of existing actions derived from task execution, showing a trend towards sophisticated methodological implementations that can span across real and virtual environments.

Memory Systems

Discussion on memory systems in LMAs indicates an emerging trend towards integrating long-term memory capabilities, enhancing their functionality in complex task environments. This integration aids in storing and retrieving experiences or data, improving task accuracy and efficiency.

LMA Categorization and Taxonomy

The paper introduces an innovative taxonomy categorizing LMAs into four distinct types primarily based on their planning capabilities and memory integration. From closed-source LLMs acting as basic planners without memory functionality to advanced systems featuring interactive long-term memory, the taxonomy provides a structured framework reflecting the evolutionary advancements in LMA development.

Collaborative Frameworks

Expanding beyond single-agent models, the review discusses multi-agent collaboration, providing insights into frameworks that involve multiple LMAs working synergistically. This segment highlights the importance of role differentiation and strategic task distribution among agents to optimize collective performance in complex scenarios.

Evaluation Strategies

A critical analysis of existing evaluation methodologies for LMAs is presented, revealing a gap in comprehensive and standardized evaluation frameworks. It promotes the development of rigorous, scenario-specific benchmarks that can effectively measure the functionality and performance of LMAs across various tasks.

Practical Applications and Real-World Utility

This section elucidates the extensive applications of LMAs, from GUI automation and robotics to complex reasoning tasks and autonomous systems. It underscores their potential in revolutionizing various industry sectors by providing sophisticated, multimodal task-handling capabilities.

Conclusions and Future Directions

The paper concludes with a thoughtful examination of current challenges and potential future directions in LMA research. It emphasizes the need for unified systems with direct memory manipulation, improved collaborative multi-agent frameworks, more robust evaluation mechanisms, and expanded real-world applications. The conclusion serves as a call to action for the research community to address these challenges and harness the full potential of LMAs in advancing AI technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2:3, 2023.
  2. Whisper: Tracing the spatiotemporal process of information diffusion in real time. IEEE transactions on visualization and computer graphics, 18(12):2649–2658, 2012.
  3. Large language models are visual reasoning coordinators. arXiv preprint arXiv:2310.15166, 2023.
  4. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571, 2023.
  5. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500, 2023.
  6. ddupont808. Gpt-4v-act. https://github.com/ddupont808/GPT-4V-Act, 2023. Accessed on February 23, 2024.
  7. Drive like a human: Rethinking autonomous driving with large language models. arXiv preprint arXiv:2307.07162, 2023.
  8. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
  9. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  10. Clova: A closed-loop visual assistant with tool usage and update. arXiv preprint arXiv:2312.10908, 2023.
  11. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  12. Avis: Autonomous visual information seeking with large language model agent. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  13. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995, 2023.
  14. Vision-by-language for training-free compositional image retrieval. arXiv preprint arXiv:2310.09291, 2023.
  15. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  16. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024.
  17. Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation. arXiv preprint arXiv:2312.03003, 2023.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  20. Mulan: Multimodal-llm agent for progressive multi-object diffusion. arXiv preprint arXiv:2402.12741, 2024.
  21. A systematic investigation of commonsense knowledge in large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11838–11855, 2022.
  22. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. Advances in Neural Information Processing Systems, 36, 2024.
  23. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023.
  24. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  25. Towards robust multi-modal reasoning via model selection. arXiv preprint arXiv:2310.08446, 2023.
  26. Wavjourney: Compositional audio creation with large language models. arXiv preprint arXiv:2307.14335, 2023.
  27. Multimodal embodied interactive agent for cafe scene. arXiv preprint arXiv:2402.00290, 2024.
  28. Paddleseg: A high-efficient development toolkit for image segmentation. arXiv preprint arXiv:2101.06175, 2021.
  29. Discuss before moving: Visual language navigation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023.
  30. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  31. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024.
  32. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  33. Autonomous driving: technical, legal and social aspects. Springer Nature, 2016.
  34. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
  35. Policy-focused agent-based modeling using rl behavioral models. arXiv preprint arXiv:2006.05048, 2020.
  36. Large language models and knowledge graphs: Opportunities and challenges. arXiv preprint arXiv:2308.06374, 2023.
  37. Mp5: A multi-modal open-ended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472, 2023.
  38. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  40. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  41. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  42. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023.
  43. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  44. Webwise: Web interface control and sequential exploration with large language models. arXiv preprint arXiv:2310.16042, 2023.
  45. Grid: A platform for general robot intelligence development. arXiv preprint arXiv:2310.00887, 2023.
  46. Mllm-tool: A multimodal large language model for tool agent learning. arXiv preprint arXiv:2401.10727, 2024.
  47. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407, 2023.
  48. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
  49. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  50. Completely model-free rl-based consensus of continuous-time multi-agent systems. Applied Mathematics and Computation, 382:125312, 2020.
  51. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023.
  52. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023.
  53. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023.
  54. Droidbot-gpt: Gpt-powered ui automation for android. arXiv preprint arXiv:2304.07061, 2023.
  55. On the road with gpt-4v (ision): Early explorations of visual-language model on autonomous driving. arXiv preprint arXiv:2311.05332, 2023.
  56. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152, 1995.
  57. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  58. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023.
  59. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  60. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  61. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024.
  62. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023.
  63. Instructp2p: Learning to edit 3d point clouds with text instructions. arXiv preprint arXiv:2306.07154, 2023.
  64. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
  65. Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588, 2023.
  66. Supervised knowledge makes large language models better in-context learners. arXiv preprint arXiv:2312.15918, 2023.
  67. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752, 2023.
  68. Embodied multi-modal agent trained by an llm from a parallel textworld. arXiv preprint arXiv:2311.16714, 2023.
  69. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  70. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  71. Doraemongpt: Toward understanding dynamic scenes with large language models. arXiv preprint arXiv:2401.08392, 2024.
  72. Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, pages 341–357. Springer, 2022.
  73. Musicagent: An ai agent for music understanding and generation with large language models. arXiv preprint arXiv:2310.11954, 2023.
  74. Craft: Customizing llms by creating and retrieving from specialized toolsets. arXiv preprint arXiv:2309.17428, 2023.
  75. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
  76. Bootstrap your own skills: Learning to solve new tasks with large language model guidance. arXiv preprint arXiv:2310.10021, 2023.
  77. Loop copilot: Conducting ai ensembles for music generation and iterative editing. arXiv preprint arXiv:2310.12404, 2023.
  78. How do large language models capture the ever-changing world knowledge? a review of recent advances. arXiv preprint arXiv:2310.07343, 2023.
  79. See and think: Embodied agent in virtual environment. arXiv preprint arXiv:2311.15209, 2023.
  80. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. arXiv preprint arXiv:2310.16436, 2023.
  81. Vision language models in autonomous driving and intelligent transportation systems. arXiv preprint arXiv:2310.14414, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Junlin Xie (8 papers)
  2. Zhihong Chen (63 papers)
  3. Ruifei Zhang (7 papers)
  4. Xiang Wan (93 papers)
  5. Guanbin Li (177 papers)
Citations (23)
X Twitter Logo Streamline Icon: https://streamlinehq.com