Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Position Paper: Agent AI Towards a Holistic Intelligence (2403.00833v1)

Published 28 Feb 2024 in cs.AI

Abstract: Recent advancements in large foundation models have remarkably enhanced our understanding of sensory information in open-world environments. In leveraging the power of foundation models, it is crucial for AI research to pivot away from excessive reductionism and toward an emphasis on systems that function as cohesive wholes. Specifically, we emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions. The emerging field of Agent AI spans a wide range of existing embodied and agent-based multimodal interactions, including robotics, gaming, and healthcare systems, etc. In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model. On top of this idea, we discuss how agent AI exhibits remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Furthermore, we discuss the potential of Agent AI from an interdisciplinary perspective, underscoring AI cognition and consciousness within scientific discourse. We believe that those discussions serve as a basis for future research directions and encourage broader societal engagement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (152)
  1. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  4. Tunenet: One-shot residual tuning for system identification and sim-to-real robot task transfer. In Conference on Robot Learning, pages 445–455. PMLR.
  5. A review on innovation in healthcare sector (telehealth) through artificial intelligence. Sustainability, 15(8):6655.
  6. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683.
  7. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  8. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654.
  9. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171.
  10. Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706.
  11. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
  12. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817.
  13. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  14. Consciousness in artificial intelligence: insights from the science of consciousness. arXiv preprint arXiv:2308.08708.
  15. Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. arXiv preprint arXiv:2309.10309.
  16. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, 32.
  17. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33:4247–4258.
  18. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12875–12884.
  19. Videollm: Modeling video sequence with large language models.
  20. Nice: Neural image commenting with empathy. In EMNLP 2021.
  21. Mapping natural-language problems to formal-language solutions using structured neural representations. In ICML 2020.
  22. Adversarial diversity in hanabi. In The Eleventh International Conference on Learning Representations.
  23. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391.
  24. Robothor: An open simulation-to-real embodied ai platform. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3164–3174.
  25. Can an embodied agent find your" cat-shaped mug"? llm-based zero-shot object navigation. arXiv preprint arXiv:2303.03480.
  26. Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649.
  27. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  28. Video language planning. arXiv preprint arXiv:2310.10625.
  29. Agent ai: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568.
  30. Agent foundation model.
  31. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. arXiv preprint arXiv:2104.08455.
  32. Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4497–4506.
  33. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181.
  34. Robust conversational ai with grounded text generation. arXiv preprint arXiv:2009.03457.
  35. Neural approaches to conversational information retrieval. arXiv preprint arXiv:2201.05176.
  36. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems, 4:265–293.
  37. Navigating to objects in the real world. Science Robotics, 8(79):eadf6991.
  38. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971.
  39. Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896.
  40. The benefits of playing video games. American psychologist, 69(1):66.
  41. Vlc: Training vision-language transformers from captions.
  42. Kat: A knowledge augmented transformer for vision-and-language. In NAACL 2022. Long paper, Oral.
  43. Ros navigation: Concepts and tutorial. Robot Operating System (ROS) The Complete Reference (Volume 1), pages 121–160.
  44. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  45. Scaling up and distilling down: Language-guided robot skill acquisition. arXiv preprint arXiv:2307.14535.
  46. Large language models in textual analysis for gesture selection. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, pages 378–387.
  47. Retinagan: An object-aware approach to sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10920–10926. IEEE.
  48. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE.
  49. Ark: Augmented reality with knowledge interactive emergent ability. arXiv preprint arXiv:2305.00970.
  50. Turbo learning for captionbot and drawingbot. In NeurIPS 2018.
  51. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9118–9147. PMLR.
  52. Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608.
  53. DFC Intelligence. 2020. Global video game audience reaches 3.7 billion. https://www.dfcint.com/global-video-game-audience-reaches-3-7-billion/. Accessed: 2024-02-05.
  54. Stephen James and Andrew J Davison. 2022. Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619.
  55. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR.
  56. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  57. Vima: General robot manipulation with multimodal prompts. arXiv.
  58. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems, 27.
  59. Gen2sim: Scaling up robot learning in simulation with generative models. arXiv preprint arXiv:2310.18308.
  60. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474.
  61. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In arXiv:1602.07332.
  62. Words into action: Learning diverse humanoid robot behaviors using language guided iterative motion refinement. arXiv preprint arXiv:2310.06226.
  63. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239.
  64. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS.
  65. Interactive task planning with language models. arXiv preprint arXiv:2310.10645.
  66. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272.
  67. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
  68. Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning. arXiv preprint arXiv:2310.09676.
  69. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  70. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
  71. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378.
  72. Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753.
  73. Mo-vln: A multi-task benchmark for open-set zero-shot vision-and-language navigation. arXiv preprint arXiv:2306.10322.
  74. Microsoft coco: Common objects in context. Proceedings of ECCV.
  75. C Karen Liu and Dan Negrut. 2021. The role of physics-based simulators in robotics. Annual Review of Control, Robotics, and Autonomous Systems, 4:35–58.
  76. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931.
  77. Video-chatgpt: Towards detailed video understanding via large vision and language models.
  78. The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection. IEEE Transactions on Affective Computing.
  79. Gary Marcus and Ernest Davis. 2019. Rebooting AI: Building artificial intelligence we can trust. Pantheon.
  80. Unrealrox: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. Virtual Reality, 24:271–288.
  81. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  82. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334.
  83. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074.
  84. Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters.
  85. Human-level control through deep reinforcement learning. nature, 518(7540):529–533.
  86. Sim4cv: A photo-realistic simulator for computer vision applications. International Journal of Computer Vision, 126:902–919.
  87. Grid: Scene-graph-based instruction-driven robotic task planning. arXiv preprint arXiv:2309.07726.
  88. OpenAI. 2023. GPT-4 technical report. Technical report, OpenAI.
  89. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  90. Human-assisted continual robot learning with foundation models. arXiv preprint arXiv:2309.14321.
  91. Multimodal agent – localized symbolic knowledge distillation for visual commonsense models. In NeurIPS 2023.
  92. Localized symbolic knowledge distillation for visual commonsense models. In Thirty-seventh Conference on Neural Information Processing Systems.
  93. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
  94. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  95. Virtualhome: Simulating household activities via programs. In 2018 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pages 8494–8502.
  96. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:2310.13724.
  97. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900.
  98. Cape: Corrective actions from precondition errors using large language models. In 2nd Workshop on Language and Robot Learning: Language as Grounding.
  99. Rl-cyclegan: Reinforcement learning aware simulation-to-real. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11157–11166.
  100. The curious case of hallucinations in neural machine translation. arXiv preprint arXiv:2104.06683.
  101. Exploring models and data for image question answering. Advances in neural information processing systems, 28.
  102. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156.
  103. Nerf-slam: Real-time dense monocular slam with neural radiance fields. arXiv preprint arXiv:2210.13641.
  104. Constraint-aware policy for compliant manipulation.
  105. Task-grasping from a demonstrated human strategy. In 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids), pages 880–887.
  106. Diverse conventions for human-AI collaboration. In Thirty-seventh Conference on Neural Information Processing Systems.
  107. Task-sequencing simulator: Integrated machine learning to execution simulation for robot manipulation. arXiv preprint arXiv:2301.01382.
  108. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347.
  109. Proximal policy optimization algorithms.
  110. An extensible, data-oriented architecture for high-performance, many-world simulation. ACM Trans. Graph., 42(4).
  111. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492–504. PMLR.
  112. Mutex: Learning unified policies from multimodal task specifications. arXiv preprint arXiv:2309.14320.
  113. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, pages 621–635. Springer.
  114. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
  115. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR.
  116. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
  117. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326.
  118. Roboclip: One demonstration is enough to learn robot policies. arXiv preprint arXiv:2310.07899.
  119. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR.
  120. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS).
  121. Deep gesture generation for social robots using type-specific libraries. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8286–8291. IEEE.
  122. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE.
  123. Sean 2.0: Formalizing and generating social situations for robot navigation. IEEE Robotics and Automation Letters, 7(4):11047–11054.
  124. A learning-from-observation framework: One-shot robot teaching for grasp-manipulation-release household operations. In 2021 IEEE/SICE International Symposium on System Integration (SII). IEEE.
  125. Bias in emotion recognition with chatgpt. arXiv preprint arXiv:2310.11753.
  126. Chatgpt empowered long-step robot control in various environments: A case application. IEEE Access, 11:95060–95078.
  127. Gpt-4v(ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015.
  128. Gpt models meet robotic applications: Co-speech gesturing chat system. arXiv preprint arXiv:2306.01741.
  129. Logical transformers: Infusing logical structures into pre-trained language models. In Proceedings of ACL 2023.
  130. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
  131. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR 2019.
  132. Internvid: A large-scale video-text dataset for multimodal understanding and generation.
  133. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455.
  134. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
  135. P. H. Winston. 1972. The m.i.t. robot. In D. Michie, editor, Machine Intelligence 7. Edinburgh University Press, Edinburgh, Scotland.
  136. Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588.
  137. On the evaluations of chatgpt and emotion-enhanced prompting for mental health analysis. arXiv preprint arXiv:2304.03347.
  138. React: Synergizing reasoning and acting in language models.
  139. Improved trust in human-robot collaboration with chatgpt. IEEE Access.
  140. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  141. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer.
  142. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647.
  143. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR.
  144. Hierarchical object-to-zone graph for object navigation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 15130–15140.
  145. Is chatgpt equipped with emotional dialogue capabilities? arXiv preprint arXiv:2304.09582.
  146. Assist: Interactive scene nodes for scalable and realistic indoor simulation. arXiv preprint arXiv:2311.06211.
  147. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986.
  148. Generalizable long-horizon manipulations with large language models. arXiv preprint arXiv:2310.02264.
  149. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754.
  150. Minigpt-4: Enhancing vision-language understanding with advanced large language models.
  151. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232.
  152. Fast model identification via physics engines for data-efficient policy search. arXiv preprint arXiv:1710.08893.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Qiuyuan Huang (23 papers)
  2. Naoki Wake (34 papers)
  3. Bidipta Sarkar (9 papers)
  4. Zane Durante (12 papers)
  5. Ran Gong (17 papers)
  6. Rohan Taori (14 papers)
  7. Yusuke Noda (6 papers)
  8. Demetri Terzopoulos (44 papers)
  9. Noboru Kuno (3 papers)
  10. Ade Famoti (3 papers)
  11. Ashley Llorens (4 papers)
  12. John Langford (94 papers)
  13. Hoi Vo (4 papers)
  14. Li Fei-Fei (199 papers)
  15. Katsu Ikeuchi (2 papers)
  16. Jianfeng Gao (344 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com