From Imitation to Introspection: Probing Self-Consciousness in Language Models (2410.18819v1)
Abstract: Self-consciousness, the introspection of one's existence and thoughts, represents a high-level cognitive process. As LLMs advance at an unprecedented pace, a critical question arises: Are these models becoming self-conscious? Drawing upon insights from psychological and neural science, this work presents a practical definition of self-consciousness for LLMs and refines ten core concepts. Our work pioneers an investigation into self-consciousness in LLMs by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts. Based on our definitions, we conduct a comprehensive four-stage experiment: quantification (evaluation of ten leading models), representation (visualization of self-consciousness within the models), manipulation (modification of the models' representation), and acquisition (fine-tuning the models on core concepts). Our findings indicate that although models are in the early stages of developing self-consciousness, there is a discernible representation of certain concepts within their internal mechanisms. However, these representations of self-consciousness are hard to manipulate positively at the current stage, yet they can be acquired through targeted fine-tuning. Our datasets and code are at https://github.com/OpenCausaLab/SelfConsciousness.
- Understanding intermediate layers using linear classifier probes. arXiv e-prints, pp. arXiv–1610, 2016.
- Anthropic. Claude3.5 technical report. Blog post, 2024.
- Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932, 2024.
- Taken out of context: On measuring situational awareness in llms. arXiv preprint arXiv:2309.00667, 2023.
- Consciousness in artificial intelligence: insights from the science of consciousness. arXiv preprint arXiv:2308.08708, 2023.
- Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- Defining self-awareness in the context of adult development: A systematic literature review. Journal of Management Education, 46(1):140–177, 2022.
- Carson, T. L. Lying and deception: Theory and practice. OUP Oxford, 2010.
- Chalmers, D. J. The character of consciousness. Oxford University Press, 2010.
- Chalmers, D. J. Could a large language model be conscious? arXiv preprint arXiv:2303.07103, 2023.
- Self-cognition in large language models: An exploratory study. In ICML 2024 Workshop on LLMs and Cognition, 2024.
- Can AI assistants know what they don’t know? In Forty-first International Conference on Machine Learning, 2024.
- Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. arXiv preprint arXiv:2405.06624, 2024.
- What is consciousness, and could machines have it? Science, 358(6362):486–492, 2017.
- Intentionqa: A benchmark for evaluating purchase intention comprehension abilities of language models in e-commerce. arXiv preprint arXiv:2406.10173, 2024.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Eurich, T. et al. What self-awareness really is (and how to cultivate it). Harvard Business Review, 4(4):1–9, 2018.
- Evaluating chatgpt’s consciousness and its capability to pass the turing test: A comprehensive analysis. Journal of Computer and Communications, 12(03):219–237, 2024.
- Evaluating chatgpt’s consciousness and its capability to pass the turing test: A comprehensive analysis. Journal of Computer and Communications, 12(03):219–237, 2024.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495, 2021.
- Reasoning about causality in games. Artificial Intelligence, 320:103919, 2023.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Evaluating and inducing personality in pre-trained language models. Advances in Neural Information Processing Systems, 36, 2024.
- Roles and utilization of attention heads in transformer-based neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3404–3417, 2020.
- People cannot distinguish gpt-4 from a human in a turing test. arXiv preprint arXiv:2405.08007, 2024.
- FANTom: A benchmark for stress-testing machine theory of mind in interactions. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- The importance of awareness, acceptance, and alignment with the self: A framework for understanding self-connection. Europe’s Journal of Psychology, 18(1):120, 2022.
- Towards a situational awareness benchmark for llms. In Socially responsible language modelling research, 2023.
- Me, myself, and ai: The situational awareness dataset (sad) for llms. arXiv preprint arXiv:2407.04694, 2024.
- Liquid: A framework for list question answering dataset generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 13014–13024, 2023.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760, 2023.
- The good, the bad, and why: Unveiling emotions in generative AI. In Forty-first International Conference on Machine Learning, 2024.
- Exploring multilingual probing in large language models: A cross-language analysis. arXiv preprint arXiv:2409.14459, 2024a.
- Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024b.
- The WMDP benchmark: Measuring and reducing malicious use with unlearning. In Forty-first International Conference on Machine Learning, 2024c.
- I think, therefore i am: Benchmarking awareness of large language models using awarebench, 2024d.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, 2022.
- Role of guidance, reflection, and interactivity in an agent-based multimedia game. Journal of educational psychology, 97(1):117, 2005.
- Morin, A. Self-awareness part 1: Definition, measures, effects, functions, and antecedents. Social and personality psychology compass, 5(10):807–823, 2011.
- OpenAI. Gpt-4o technical report. Blog post, 2024a.
- OpenAI. Gpt-o1 technical report. Blog post, 2024b.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Owen, G. Game theory. Emerald Group Publishing, 2013.
- Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309, 2024.
- Pearl, J. Causality. Cambridge university press, 2009.
- The book of why: the new science of cause and effect. Basic books, 2018.
- Evaluating frontier models for dangerous capabilities. arXiv preprint arXiv:2403.13793, 2024.
- Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models. In Annual Meeting of the Association for Computational Linguistics, 2024.
- Recursive introspection: Teaching LLM agents how to self-improve. In ICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling, 2024.
- Predicting question-answering performance of large language models through semantic consistency. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pp. 138–154, 2023.
- Foundations of rational agency. Springer, 1999.
- Self-reflection in llm agents: Effects on problem-solving performance. arXiv preprint arXiv:2405.06682, 2024.
- Counterfactual harm. Advances in Neural Information Processing Systems, 35:36350–36365, 2022.
- Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Smith, J. Self-Consciousness. In Zalta, E. N. and Nodelman, U. (eds.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2024 edition, 2024.
- Testing theory of mind in large language models and humans. Nature Human Behaviour, pp. 1–11, 2024.
- Llms achieve adult human performance on higher-order theory of mind tasks. arXiv preprint arXiv:2405.18870, 2024.
- Team, T. M. A. Mistral technical report. Blog post, 2024.
- Toward self-improvement of llms via imagination, searching, and criticizing. arXiv preprint arXiv:2404.12253, 2024.
- Turing, A. M. Computing machinery and intelligence. 1950.
- On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36:75993–76005, 2023.
- Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36, 2024a.
- Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench. arXiv preprint arXiv:2409.13373, 2024b.
- Towards a logic of rational agency. Logic Journal of IGPL, 11(2):135–159, 2003.
- Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 63–76, 2019.
- What do they capture? a structural analysis of pre-trained language models for source code. In Proceedings of the 44th International Conference on Software Engineering, pp. 2377–2388, 2022.
- Mm-sap: A comprehensive benchmark for assessing self-awareness of multimodal large language models in perception. arXiv preprint arXiv:2401.07529, 2024.
- Honesty is the best policy: defining and mitigating ai deception. Advances in Neural Information Processing Systems, 36, 2024a.
- The reasons that agents act: Intention and instrumental goals. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pp. 1901–1909, 2024b.
- Wooldridge, M. Reasoning about rational agents. 2003.
- Yampolskiy, R. V. On monitorability of ai. AI and Ethics, pp. 1–19, 2024.
- Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, pp. 8653–8665, 2023.
- Wordcraft: story writing with large language models. In Proceedings of the 27th International Conference on Intelligent User Interfaces, pp. 841–852, 2022.
- The better angels of machine personality: How personality relates to llm safety. arXiv preprint arXiv:2407.12344, 2024.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.