Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models (2307.03762v1)
Abstract: In this perspective paper, we first comprehensively review existing evaluations of LLMs using both standardized tests and ability-oriented benchmarks. We pinpoint several problems with current evaluation methods that tend to overstate the capabilities of LLMs. We then articulate what artificial general intelligence should encompass beyond the capabilities of LLMs. We propose four characteristics of generally intelligent agents: 1) they can perform unlimited tasks; 2) they can generate new tasks within a context; 3) they operate based on a value system that underpins task generation; and 4) they have a world model reflecting reality, which shapes their interaction with the world. Building on this viewpoint, we highlight the missing pieces in artificial general intelligence, that is, the unity of knowing and acting. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. Additionally, knowledge acquisition isn't solely reliant on passive input but requires repeated trials and errors. We conclude by outlining promising future research directions in the field of artificial general intelligence.
- Have llms advanced enough? a challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
- Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Chollet, F. (2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547.
- Chomsky, N. (2009). Syntactic structures. In Syntactic Structures. De Gruyter Mouton.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Daniel, K. (2013). Thinking, fast and slow. Farrar, Straus and Giroux.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435.
- Large language models are not abstract reasoners. arXiv preprint arXiv:2305.19555.
- Glenberg, A. M. (2010). Embodiment as a unifying perspective for psychology. Wiley interdisciplinary reviews: Cognitive science, 1(4):586–596.
- Goertzel, B. (2014). Artificial general intelligence: concept, state of the art, and future prospects. Journal of Artificial General Intelligence, 5(1):1.
- Detecting blickets: How young children use information about novel causal powers in categorization and induction. Child development, 71(5):1205–1222.
- World models. arXiv preprint arXiv:1803.10122.
- Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Mewl: Few-shot multimodal word learning with referential uncertainty. arXiv preprint arXiv:2306.00503.
- Mpi: Evaluating and inducing personality in pre-trained language models. arXiv preprint arXiv:2206.07550.
- Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836.
- Kosinski, M. (2023). Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083.
- Review of philosophy in the flesh: The embodied mind and its challenge to western thought. Computational Linguistics, 25(4):631–634.
- Levinovitz, A. (2017). Slaying the chinese jabberwock: Toward a comparative philosophy of nonsense. Comparative Literature, 69(3):251–270.
- Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning, pages 80–93. PMLR.
- Tomchallenges: A principle-guided dataset and diverse evaluation tasks for exploring theory of mind. arXiv preprint arXiv:2305.15068.
- Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627.
- Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479.
- The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533.
- Large language models are fixated by red herrings: Exploring creative problem solving and einstellung effect using the only connect wall dataset. arXiv preprint arXiv:2306.11167.
- OpenAI, R. (2023). Gpt-4 technical report. arXiv.
- Popper, K. (1978). Natural selection and the emergence of mind. Dialectica, pages 339–355.
- Putnam, H. et al. (1981). Reason, truth and history, volume 3. Cambridge University Press.
- Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
- Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763.
- Probing the psychology of ai models. Proceedings of the National Academy of Sciences, 120(10):e2300963120.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
- The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Large language models can be lazy learners: Analyze shortcuts in in-context learning. arXiv preprint arXiv:2305.17256.
- Large language models are in-context semantic reasoners rather than symbolic reasoners. arXiv preprint arXiv:2305.14825.
- Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275.
- Ullman, T. (2023). Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399.
- Visone, J. D. (2010). Science or reading: What is being measured by standardized tests? American Secondary Education, pages 95–112.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
- Wang, Y. (1963). Instructions for Practical Living, and Other Neo-Confucian Writing. Columbia University Press.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337.
- Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841.
- Est: Evaluating scientific thinking in artificial agents. arXiv preprint arXiv:2206.09203.
- Vnhsge: Vietnamese high school graduation examination dataset for large language models. arXiv preprint arXiv:2305.12199.
- Imitation versus innovation: What children can do that large language and language-and-vision models cannot (yet)? arXiv preprint arXiv:2305.07666.
- Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5317–5327.
- Acre: Abstract causal reasoning beyond covariation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10643–10653.
- Pointer value retrieval: A new benchmark for understanding the limits of neural network generalization. arXiv preprint arXiv:2107.12580.
- Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144.
- Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3):310–345.
- Understanding tools: Task-oriented object modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2855–2864.