Papers
Topics
Authors
Recent
Search
2000 character limit reached

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

Published 30 Oct 2024 in cs.AI | (2410.23242v2)

Abstract: As general-purpose tools, LLMs must often reason about everyday physical environments. In a question-and-answer capacity, understanding the interactions of physical objects may be necessary to give appropriate responses. Moreover, LLMs are increasingly used as reasoning engines in agentic systems, designing and controlling their action sequences. The vast majority of research has tackled this issue using static benchmarks, comprised of text or image-based questions about the physical world. However, these benchmarks do not capture the complexity and nuance of real-life physical processes. Here we advocate for a second, relatively unexplored, approach: 'embodying' the LLMs by granting them control of an agent within a 3D environment. We present the first embodied and cognitively meaningful evaluation of physical common-sense reasoning in LLMs. Our framework allows direct comparison of LLMs with other embodied agents, such as those based on Deep Reinforcement Learning, and human and non-human animals. We employ the Animal-AI (AAI) environment, a simulated 3D virtual laboratory, to study physical common-sense reasoning in LLMs. For this, we use the AAI Testbed, a suite of experiments that replicate laboratory studies with non-human animals, to study physical reasoning capabilities including distance estimation, tracking out-of-sight objects, and tool use. We demonstrate that state-of-the-art multi-modal models with no finetuning can complete this style of task, allowing meaningful comparison to the entrants of the 2019 Animal-AI Olympics competition and to human children. Our results show that LLMs are currently outperformed by human children on these tasks. We argue that this approach allows the study of physical reasoning using ecologically valid experiments drawn directly from cognitive science, improving the predictability and reliability of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Evaluating multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903, 2023.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  3. Prost: Physical reasoning of objects through space and time. arXiv preprint arXiv:2106.03634, 2021.
  4. Modeling human intuitions about liquid flow with particle-based simulation. PLoS computational biology, 15(7):e1007210, 2019.
  5. Computational models of intuitive physics. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 34, 2012.
  6. Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.  5185–5198, 2020.
  7. The Animal-AI environment: Training and testing animal-like artificial cognition. arXiv preprint arXiv:1909.07483, 2019.
  8. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  9. The concept of validity. Psychological review, 111(4):1061, 2004.
  10. John Burden. Evaluating ai evaluation: Perils and prospects. arXiv preprint arXiv:2407.09221, 2024.
  11. Inferring capabilities from task performance with bayesian triangulation. arXiv preprint arXiv:2309.11975, 2023.
  12. Not a number: Identifying instance features for capability-oriented evaluation. In IJCAI, pp.  2827–2835, 2022.
  13. Have we built machines that think like people? arXiv preprint arXiv:2311.16093, 2024.
  14. Chatgpt in action: Analyzing its use in software development. In Proceedings of the 21st International Conference on Mining Software Repositories, pp.  182–186, 2024.
  15. S-agents: self-organizing agents in open-ended environment. arXiv preprint arXiv:2402.04578, 2024.
  16. Intuitive physical reasoning about occluded objects by inexperienced chicks. Proceedings of the Royal Society B: Biological Sciences, 278(1718):2621–2627, 2011.
  17. Construct validity in psychological tests. Psychological bulletin, 52(4):281, 1955.
  18. The Animal-AI Olympics. Nature Machine Intelligence, 1(5):257, 2019.
  19. The animal-ai testbed and competition. In Neurips 2019 competition and demonstration track, pp.  164–176. PMLR, 2020.
  20. Learning the effects of physical actions in a multi-modal environment. arXiv preprint arXiv:2301.11845, 2023.
  21. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  22. Guiding pretraining in reinforcement learning with large language models. In International Conference on Machine Learning, pp.  8657–8677. PMLR, 2023.
  23. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  24. Llama rider: Spurring large language models to explore the open world. arXiv preprint arXiv:2310.08922, 2023.
  25. Mathematical capabilities of chatgpt. Advances in neural information processing systems, 36, 2024.
  26. The development of human causal learning and reasoning. Nature Reviews Psychology, pp.  1–21, 2024.
  27. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023.
  28. Mechanisms of theory formation in young children. Trends in cognitive sciences, 8(8):371–377, 2004.
  29. Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021.
  30. José Hernández-Orallo. Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artificial Intelligence Review, 48:397–447, 2017.
  31. A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039, 2024.
  32. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning, pp.  9118–9147. PMLR, 2022.
  33. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2024.
  34. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2(3):6, 2022.
  35. Arthur Juliani. Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627, 2018.
  36. Intuitive physics: Current research and controversies. Trends in cognitive sciences, 21(10):749–759, 2017.
  37. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
  38. Metaphors we live by. University of Chicago press, 2008.
  39. Llm-powered hierarchical language agent for real-time human-ai coordination. arXiv preprint arXiv:2312.15224, 2023.
  40. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. arXiv preprint arXiv:2312.11865, 2023.
  41. Deep learning, reinforcement learning, and world models. Neural Networks, 152:267–275, 2022.
  42. Melanie Mitchell. Why ai is harder than we think. arXiv preprint arXiv:2104.12871, 2021.
  43. D. J. Povinelli. Folk Physics for Apes: The Chimpanzee’s theory of how the world works. Oxford University Press, 2003.
  44. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  8494–8502, 2018.
  45. Learning to localize objects improves spatial reasoning in visual-llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12977–12987, 2024.
  46. General interaction battery: Simple object navigation and affordances (gibsona). Available at SSRN 4924246, 2024.
  47. Commonsense reasoning for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp.  27–33, 2020.
  48. Murray Shanahan. Embodiment and the inner life: cognition and consciousness in the space of possible minds. Oxford University Press, 2010.
  49. Artificial intelligence and the common sense of animals. Trends in Cognitive Sciences, 24(11):862–872, 2020.
  50. Swarmbrain: Embodied agent for real-time strategy game starcraft ii via large language models. arXiv preprint arXiv:2401.17749, 2024.
  51. Different physical intuitions exist between tasks, not domains. Computational Brain & Behavior, 1:101–118, 2018.
  52. Regal: Refactoring programs to discover generalizable abstractions. arXiv preprint arXiv:2401.16467, 2024.
  53. Commonsense reasoning for natural language understanding: A survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172, pp.  1–60, 2019.
  54. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023.
  55. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279–1285, 2011.
  56. Esther Thelen. Grounded in the world: Developmental origins of the embodied mind. Infancy, 1(1):3–28, 2000.
  57. Macgyver: Are large language models creative problem solvers? arXiv preprint arXiv:2311.09682, 2023.
  58. Interpretable counting for visual question answering. arXiv preprint arXiv:1712.08697, 2017.
  59. Behind the magic, merlim: Multi-modal evaluation benchmark for large image-language models. arXiv preprint arXiv:2312.02219, 2023.
  60. Direct human-ai comparison in the animal-ai environment. Frontiers in Psychology, 13:711821, 2022a.
  61. Evaluating object permanence in embodied agents using the animal-ai environment. In EBeM’22: Workshop on AI Evaluation Beyond Metrics, Vienna, Austria, 2022b.
  62. Animal-ai 3: What’s new & why you should care. arXiv preprint arXiv:2312.11414, 2023.
  63. Investigating object permanence in deep reinforcement learning agents. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46, 2024.
  64. Evaluating open-qa evaluation. Advances in Neural Information Processing Systems, 36, 2024a.
  65. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  66. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024b.
  67. Newton: Are large language models capable of physical reasoning? arXiv preprint arXiv:2310.07018, 2023b.
  68. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023c.
  69. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023d.
  70. Phillip Wolff. Representing causation. Journal of experimental psychology: General, 136(1):82, 2007.
  71. Spring: Studying papers and reasoning to play games. Advances in Neural Information Processing Systems, 36, 2024.
  72. Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems, 36, 2024.
  73. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244, 2024.
  74. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  75. Skill reinforcement learning and planning for open-world long-horizon tasks. arXiv preprint arXiv:2303.16563, 2023a.
  76. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023b.
  77. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021.
  78. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023a.
  79. Creative agents: Empowering agents with imagination for creative tasks. arXiv preprint arXiv:2312.02519, 2023b.
  80. Mobile-env: an evaluation platform and benchmark for llm-gui interaction. arXiv preprint arXiv:2305.08144, 2023c.
  81. Omni: Open-endedness via models of human notions of interestingness. arXiv preprint arXiv:2306.01711, 2023d.
  82. Adarefiner: Refining decisions of language models with adaptive feedback. In Findings of the Association for Computational Linguistics: NAACL 2024, pp.  782–799, 2024.
  83. Hierarchical auto-organizing system for open-ended multi-agent navigation. arXiv preprint arXiv:2403.08282, 2024.
  84. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.