Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Position: Foundation Agents as the Paradigm Shift for Decision Making (2405.17009v3)

Published 27 May 2024 in cs.AI

Abstract: Decision making demands intricate interplay between perception, memory, and reasoning to discern optimal policies. Conventional approaches to decision making face challenges related to low sample efficiency and poor generalization. In contrast, foundation models in language and vision have showcased rapid adaptation to diverse new tasks. Therefore, we advocate for the construction of foundation agents as a transformative shift in the learning paradigm of agents. This proposal is underpinned by the formulation of foundation agents with their fundamental characteristics and challenges motivated by the success of LLMs. Moreover, we specify the roadmap of foundation agents from large interactive data collection or generation, to self-supervised pretraining and adaptation, and knowledge and value alignment with LLMs. Lastly, we pinpoint critical research questions derived from the formulation and delineate trends for foundation agents supported by real-world use cases, addressing both technical and theoretical aspects to propel the field towards a more comprehensive and impactful future.

Overview of "Foundation Agents as the Paradigm Shift for Decision Making"

The paper "Position: Foundation Agents as the Paradigm Shift for Decision Making" introduces the concept of foundation agents, emphasizing their potential to revolutionize agent learning paradigms similar to the impact of large foundation models in language and vision tasks. The discussion revolves around designing foundation agents to improve sample efficiency and generalization capabilities in complex decision-making scenarios.

Key Elements of Foundation Agents

Foundation agents are conceptualized as generally capable agents adept at handling diverse decision-making tasks across physical and virtual environments. The core attributes of foundation agents include:

  1. Unified Representation: A universal framework to represent variables within the decision process, which includes state-action spaces, feedback signals like rewards or goals, and environmental dynamics.
  2. Unified Policy Interface: A consistent policy framework applicable across disparate tasks and domains, including robotics, gameplay, and healthcare.
  3. Interactive Decision-Making: The ability to reason about behaviors, address environment stochasticity and uncertainty, and navigate multi-agent competitive or cooperative scenarios.

Roadmap to Foundation Agents

The paper delineates a strategic roadmap for the development of foundation agents, which involves several stages:

  1. Large-Scale Data Collection: Interactive data can be accumulated from various sources like the internet (e.g., YouTube videos, tutorials) and real-world interactions.
  2. Self-Supervised Pretraining: Utilizing unsupervised learning techniques to pretrain models on large volumes of unannotated data.
  3. Alignment with LLMs: Integrating knowledge and values encapsulated within LLMs to enhance foundation agents' reasoning and generalization capabilities.

Self-Supervised Pretraining and Adaptation

Self-supervised learning is pivotal for the foundational aspect of these agents. The pretraining involves two notable steps:

  1. Embedding Trajectories: This includes tokenizing trajectory sequences and utilizing various Transformer architectures for sequence modeling.
  2. Learning Objectives: The learning objectives encompass autoregressive or masked modeling techniques adapted from language and vision domains. Table \ref{tab:objective} in the paper provides a comprehensive summary of these objectives.

Challenges in Foundation Agents Development

Despite the promising capabilities, several challenges must be addressed:

  1. Unified or Compositional Models: There is an ongoing debate on whether a singular, unified model should be pursued or whether a compositional approach integrating existing foundation models can be more feasible.
  2. Optimization and Theoretical Foundations: The optimization of these agents through rigorous theoretical frameworks needs extensive research, particularly to ensure efficacy and robustness.
  3. Handling Open-Ended Tasks: Foundation agents must incorporate various strategies to handle open-ended tasks characterized by evolving objectives and environments.

Use Cases and Implications

The potential impact of foundation agents spans multiple domains:

  1. Autonomous Control: Including robotics and self-driving vehicles, where foundation agents can improve adaptability and robustness.
  2. Healthcare: Enhancing diagnostic accuracy and treatment personalization by leveraging vast medical data efficiently.
  3. Scientific Research: Accelerating discovery and experimentation processes, thereby expediting scientific advancements.

Conclusion

The paper posits that the integration of extensive interactive data, self-supervised learning, and the alignment with LLMs could significantly advance the development of foundation agents. Given the complexity and diversity of decision-making tasks, future research must navigate the challenges of unified modeling and optimization strategies, ensuring foundation agents' reliability and effectiveness in real-world applications. The evolution toward foundation agents potentially marks a significant shift in artificial intelligence, bringing us closer to achieving robust and versatile autonomous systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp.  1–3, 2024.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  3. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
  4. Anthropic. Model card and evaluations for claude models. https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf, 2023.
  5. Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
  6. Human-timescale adaptation in an open-ended task space. In International Conference on Machine Learning, pp.  1887–1935. PMLR, 2023.
  7. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  8. Pasta: Pretrained action-state transformer agents. arXiv preprint arXiv:2307.10936, 2023.
  9. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.
  10. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  11. Inverse dynamics pretraining learns good representations for multitask imitation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=kjMGHTo8Cs.
  12. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  13. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  14. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  15. Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024.
  16. Reprem: Representation pre-training with masked model for reinforcement learning. arXiv preprint arXiv:2303.01668, 2023.
  17. Uni [mask]: Unified inference in sequential decision problems. Advances in neural information processing systems, 35:35365–35378, 2022.
  18. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In Conference on Robot Learning, pp.  3909–3928. PMLR, 2023.
  19. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  20. Interact: Exploring the potentials of chatgpt as a cooperative agent. arXiv preprint arXiv:2308.01552, 2023.
  21. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  22. Explainable artificial intelligence: A survey. In 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO), pp.  0210–0215. IEEE, 2018.
  23. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  24. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, pp.  8469–8488. PMLR, 2023.
  25. Learning universal policies via text-guided video generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  26. An interactive agent foundation model. arXiv preprint arXiv:2402.05929, 2024.
  27. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  28. Intrinsically motivated goal exploration processes with automatic curriculum learning. Journal of Machine Learning Research, 23(152):1–41, 2022.
  29. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370, 2023.
  30. Unsupervised behavior discovery with quality-diversity optimization. IEEE Transactions on Evolutionary Computation, 26(6):1539–1552, 2022.
  31. Learning pseudometric-based action representations for offline reinforcement learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  7902–7918. PMLR, 17–23 Jul 2022.
  32. Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4. arXiv preprint arXiv:2309.17277, 2023.
  33. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.  1861–1870. PMLR, 2018.
  34. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  35. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  36. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  37. Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901, 2023.
  38. Agentscodriver: Large language model empowered collaborative driving with lifelong learning. arXiv preprint arXiv:2404.06345, 2024.
  39. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  40. Separate mechanisms for short-and long-term memory. Behavioural brain research, 103(1):1–11, 1999.
  41. Offline reinforcement learning as one big sequence modeling problem. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235313679.
  42. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
  43. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  44. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
  45. Multi-game decision transformers. Advances in Neural Information Processing Systems, 35:27921–27936, 2022.
  46. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2):189–223, 2011.
  47. A deep reinforcement learning network for traffic light cycle control. IEEE Transactions on Vehicular Technology, 68(2):1243–1253, 2019.
  48. Ai autonomy: Self-initiated open-world continual learning and adaptation. AI Magazine, 44(2):185–199, 2023a.
  49. Masked autoencoding for scalable and generalizable decision making. Advances in Neural Information Processing Systems, 35:12608–12618, 2022.
  50. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023b.
  51. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  52. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.
  53. Augmenting large language models with chemistry tools. Nature Machine Intelligence, pp.  1–11, 2024.
  54. Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022.
  55. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023.
  56. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  57. Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596, 2023.
  58. A language agent for autonomous driving. arXiv preprint arXiv:2311.10813, 2023.
  59. Offline pre-trained multi-agent decision transformer: One big sequence model tackles all smac tasks. arXiv preprint arXiv:2112.02845, 2021.
  60. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  61. Deep reinforcement learning for energy management in a microgrid with flexible demand. Sustainable Energy, Grids and Networks, 25:100413, 2021.
  62. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  63. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  64. OpenAI. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024.
  65. Shaking the foundations: delusions in sequence models for interaction and control. arXiv preprint arXiv:2110.10819, 2021.
  66. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  67. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
  68. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.  1–22, 2023.
  69. The difficulty of novelty detection and adaptation in physical environments. In Australasian Joint Conference on Artificial Intelligence, pp.  28–40. Springer, 2023.
  70. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  71. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  72. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  73. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  74. Pandr: Fast adaptation to new environments from offline experiences via decoupling policy and environment representations. In International Joint Conference on Artificial Intelligence, 2022. URL https://api.semanticscholar.org/CorpusID:247996930.
  75. Pretraining representations for data-efficient reinforcement learning. Advances in Neural Information Processing Systems, 34:12686–12699, 2021.
  76. Bigger, better, faster: Human-level atari with human-level efficiency. In International Conference on Machine Learning, pp.  30365–30380. PMLR, 2023.
  77. Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
  78. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  79. Yell at your robot: Improving on-the-fly from language corrections. arXiv preprint arXiv:2403.12910, 2024.
  80. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  81. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020.
  82. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  83. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  84. Decoupling representation learning from reinforcement learning. In International Conference on Machine Learning, pp.  9870–9879. PMLR, 2021.
  85. Smart: Self-supervised multi-task pretraining with control transformers. arXiv preprint arXiv:2301.09816, 2023.
  86. What about inputting policy in value function: Policy representation and policy-extended value function approximator. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8441–8449, 2022.
  87. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023.
  88. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  89. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021.
  90. Todorov, E. General duality between optimal control and estimation. In 2008 47th IEEE conference on decision and control, pp.  4286–4292. IEEE, 2008.
  91. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  92. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  93. Feudal networks for hierarchical reinforcement learning. In International conference on machine learning, pp.  3540–3549. PMLR, 2017.
  94. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pp.  59–63, 2017.
  95. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  96. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26, 2024.
  97. Toward open-ended embodied tasks solving. arXiv preprint arXiv:2312.05822, 2023b.
  98. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455, 2023c.
  99. Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents. In Thirty-seventh Conference on Neural Information Processing Systems, 2023d.
  100. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  101. Is imitation all you need? generalized decision-making with dual-phase training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  16221–16231, 2023.
  102. Dilu: A knowledge-driven approach to autonomous driving with large language models. arXiv preprint arXiv:2309.16292, 2023.
  103. Masked trajectory models for prediction, representation, and control. arXiv preprint arXiv:2305.02968, 2023.
  104. Future-conditioned unsupervised pretraining for decision transformer. In International Conference on Machine Learning, pp.  38187–38203. PMLR, 2023.
  105. Prompting decision transformer for few-shot policy generalization. In international conference on machine learning, pp.  24631–24645. PMLR, 2022.
  106. Representation matters: Offline pretraining for sequential decision making. In International Conference on Machine Learning, pp.  11784–11794. PMLR, 2021.
  107. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023a.
  108. Harnessing biomedical literature to calibrate clinicians’ trust in ai decision support systems. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp.  1–14, 2023b.
  109. Video as the new language for real-world decision making. arXiv preprint arXiv:2402.17139, 2024.
  110. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  111. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  112. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  113. Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485, 2023a.
  114. Rethinking human-ai collaboration in complex medical decision making: A case study in sepsis diagnosis. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp.  1–18, 2024.
  115. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.
  116. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp.  12697–12706. PMLR, 2021.
  117. Optimization of molecules via deep reinforcement learning. Scientific reports, 9(1):10752, 2019.
  118. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiaoqian Liu (24 papers)
  2. Xingzhou Lou (7 papers)
  3. Jianbin Jiao (51 papers)
  4. Junge Zhang (47 papers)
Citations (3)